# Realfusion

本文最后更新于：2023年3月1日 上午

# RealFusion：

### 360° Reconstruction of Any Object from a Single Image

Oxford University

2023.2.23

## Demo

https://lukemelas.github.io/realfusion/

## Motivation: Single-View 3D Reconstruction

- Reconstructing the 3D structure of an object from a single 2D view is a fundamental challenge in computer vision.
- In the case of a single view, the reconstruction problem is highly ill-posed. As a result, the task requires semantic understanding obtained by learning. Despite the difficulty of this task, humans are adept at using a range of monocular cues to infer the 3D structures of objects from single views.

## Background

### Category-level 3D Reconstruction

- Most prior work tackles the problem of category-specific single-view 3D reconstruction by training a category-level reconstruction model.
- The work: Going beyond category-level 3D reconstruction
- This work aims to go beyond category-specific images to images of arbitrary objects. This setting is highly challenging, but humans perform it effortlessly when they observe new objects.

### Single-View 3D Reconstruction

- Arbitrary-object 3D reconstruction has been challenging because the problem fundamentally requires the use of large-scale 3D priors over object shapes, which have not been available.
- With the recent rise of large-scale pretraining, this problem has
become tractable. Examples include:
- Contrastive: CLIP
- Autoregressive: DALL-E / Parti
- Diffusion Models: DALL-E 2 / Imagen / Stable Diffusion

- These pretrained models may be used as priors for a variety of
vision tasks, and we are particularly interested in 3D reconstruction.
- At a high level, you can think of these models as a tool for optimizing the realism of an input image.

- In this way, they enable an elegant approach to 3D generation and reconstruction: using these large-scale pretrained models to enforce that a differentiable scene looks realistic from random views.

## Proposal

We propose

**RealFusion**, a method that can extract from a single image of an object a 360◦ photographic 3D reconstruction without assumptions on the type of object imaged or 3D supervision of any kind;We do so by leveraging an existing 2D

**diffusion image generator**via a new single image variant of textual inversion;We also introduce new regularizers and provide an efficient implementation using

**InstantNGP**;We demonstrate

**state-of-the-art**reconstruction results on a number of in-the-wild images and images from existing datasets when compared to alternative approaches.

## Related Work

- Image-based reconstruction of appearnce and geometry
- Few-view reconstruction
- Single-view reconstruction

- Extracting 3D models from 2D generators

- Diffusion Models

## Method

- This approach forms the backbone of our method, RealFusion.

- [Init] We are given a single image and a function \(\boldsymbol{p}_{\text {prior }}(\cdot)\) which computes the likelihood of an input image \(\boldsymbol{I}\). We choose a camera view and represent our scene with a differentiably-renderable representation \(\boldsymbol{x}\), for example a NeRF.
- [Reconstruction] We render \(\boldsymbol{x}\) from our given view and minimize the loss with respect to the real input image \(\mathbf{I}\).
- [Prior] We render images \(\boldsymbol{I}_{\text {prior }}\) of \(\boldsymbol{x}\) from randomly-chosen views on a hemisphere surrounding the origin, and we optimize \(\boldsymbol{p}_{\text {prior }}\left(\boldsymbol{I}_{\text {priol }}\right)\) to enforce that \(\boldsymbol{x}\) looks realistic from all directions.

- Prior work has explored this question in the domain of 3D generation
- Dreamfields: CLIP prior
- DreamFusion: Diffusion model prior

- In our work, we adopt a diffusion model prior using Stable Diffusion, a text-conditional latent diffusion model.
- As currently stated, our set up combines a reconstruction objective with a latent diffusion-based prior objective, which is conditioned on a manual text prompt (e.g. "An image of a fish.")
- However, we found that these results were lacking.
- In particular, the 3D shapes that are generated look like the input object from the input view, but do not look like the input object from other views.
- To fix this, we need to modify the prior to place a high likelihood on our input object, rather than a generic object with the same description.
- We do so by performing textual inversion.
- We optimize a text embedding \(\mathbf{e}\) in the text encoder of the diffusion model to match our input image.
- Usually textual inversion is performed with multiple views of an object, but we substitute these views with heavy image augmentations.

- We also add other pieces of regularization:
- A regularization on rendered normals
- A coarse-to-fine training setup

- However, the key piece of the puzzle is the textual inversion.

## Experiment

## Limitations

- Requires per-image optimization
- Both the textual inversion and the 3D optimization procedure must be performed separately for each input image.
- As a result, the process is relatively slow and difficult to apply to large datasets

- In some cases, reconstruction fails to produce a solid shape
- Perhaps this could be alleviated with better inductive biases or regularization terms

- In some cases, reconstruction produces two-headed objects
- This is known as the Janus Problem