Realfusion
本文最后更新于:2023年3月1日 上午
RealFusion:
360° Reconstruction of Any Object from a Single Image
Oxford University
2023.2.23
Demo
https://lukemelas.github.io/realfusion/
Motivation: Single-View 3D Reconstruction
- Reconstructing the 3D structure of an object from a single 2D view is a fundamental challenge in computer vision.
- In the case of a single view, the reconstruction problem is highly ill-posed. As a result, the task requires semantic understanding obtained by learning. Despite the difficulty of this task, humans are adept at using a range of monocular cues to infer the 3D structures of objects from single views.
Background
Category-level 3D Reconstruction
- Most prior work tackles the problem of category-specific single-view 3D reconstruction by training a category-level reconstruction model.
- The work: Going beyond category-level 3D reconstruction
- This work aims to go beyond category-specific images to images of arbitrary objects. This setting is highly challenging, but humans perform it effortlessly when they observe new objects.
Single-View 3D Reconstruction
- Arbitrary-object 3D reconstruction has been challenging because the problem fundamentally requires the use of large-scale 3D priors over object shapes, which have not been available.
- With the recent rise of large-scale pretraining, this problem has
become tractable. Examples include:
- Contrastive: CLIP
- Autoregressive: DALL-E / Parti
- Diffusion Models: DALL-E 2 / Imagen / Stable Diffusion
- These pretrained models may be used as priors for a variety of
vision tasks, and we are particularly interested in 3D reconstruction.
- At a high level, you can think of these models as a tool for optimizing the realism of an input image.
- In this way, they enable an elegant approach to 3D generation and reconstruction: using these large-scale pretrained models to enforce that a differentiable scene looks realistic from random views.
Proposal
We propose RealFusion, a method that can extract from a single image of an object a 360◦ photographic 3D reconstruction without assumptions on the type of object imaged or 3D supervision of any kind;
We do so by leveraging an existing 2D diffusion image generator via a new single image variant of textual inversion;
We also introduce new regularizers and provide an efficient implementation using InstantNGP;
We demonstrate state-of-the-art reconstruction results on a number of in-the-wild images and images from existing datasets when compared to alternative approaches.
Related Work
- Image-based reconstruction of appearnce and geometry
- Few-view reconstruction
- Single-view reconstruction
- Extracting 3D models from 2D generators
- Diffusion Models
Method
- This approach forms the backbone of our method, RealFusion.
- [Init] We are given a single image and a function \(\boldsymbol{p}_{\text {prior }}(\cdot)\) which computes the likelihood of an input image \(\boldsymbol{I}\). We choose a camera view and represent our scene with a differentiably-renderable representation \(\boldsymbol{x}\), for example a NeRF.
- [Reconstruction] We render \(\boldsymbol{x}\) from our given view and minimize the loss with respect to the real input image \(\mathbf{I}\).
- [Prior] We render images \(\boldsymbol{I}_{\text {prior }}\) of \(\boldsymbol{x}\) from randomly-chosen views on a hemisphere surrounding the origin, and we optimize \(\boldsymbol{p}_{\text {prior }}\left(\boldsymbol{I}_{\text {priol }}\right)\) to enforce that \(\boldsymbol{x}\) looks realistic from all directions.
- Prior work has explored this question in the domain of 3D generation
- Dreamfields: CLIP prior
- DreamFusion: Diffusion model prior
- In our work, we adopt a diffusion model prior using Stable Diffusion, a text-conditional latent diffusion model.
- As currently stated, our set up combines a reconstruction objective with a latent diffusion-based prior objective, which is conditioned on a manual text prompt (e.g. "An image of a fish.")
- However, we found that these results were lacking.
- In particular, the 3D shapes that are generated look like the input object from the input view, but do not look like the input object from other views.
- To fix this, we need to modify the prior to place a high likelihood on our input object, rather than a generic object with the same description.
- We do so by performing textual inversion.
- We optimize a text embedding \(\mathbf{e}\) in the text encoder of the diffusion model to match our input image.
- Usually textual inversion is performed with multiple views of an object, but we substitute these views with heavy image augmentations.
- We also add other pieces of regularization:
- A regularization on rendered normals
- A coarse-to-fine training setup
- However, the key piece of the puzzle is the textual inversion.
Experiment
Limitations
- Requires per-image optimization
- Both the textual inversion and the 3D optimization procedure must be performed separately for each input image.
- As a result, the process is relatively slow and difficult to apply to large datasets
- In some cases, reconstruction fails to produce a solid shape
- Perhaps this could be alleviated with better inductive biases or regularization terms
- In some cases, reconstruction produces two-headed objects
- This is known as the Janus Problem
Realfusion
http://enderfga.cn/2023/03/01/Realfusion/