本文最后更新于:2023年3月1日 上午


360° Reconstruction of Any Object from a Single Image

  • Oxford University

  • 2023.2.23



Motivation: Single-View 3D Reconstruction

  • Reconstructing the 3D structure of an object from a single 2D view is a fundamental challenge in computer vision.
  • In the case of a single view, the reconstruction problem is highly ill-posed. As a result, the task requires semantic understanding obtained by learning. Despite the difficulty of this task, humans are adept at using a range of monocular cues to infer the 3D structures of objects from single views.


Category-level 3D Reconstruction

  • Most prior work tackles the problem of category-specific single-view 3D reconstruction by training a category-level reconstruction model.
  • The work: Going beyond category-level 3D reconstruction
    • This work aims to go beyond category-specific images to images of arbitrary objects. This setting is highly challenging, but humans perform it effortlessly when they observe new objects.

Single-View 3D Reconstruction

  • Arbitrary-object 3D reconstruction has been challenging because the problem fundamentally requires the use of large-scale 3D priors over object shapes, which have not been available.
  • With the recent rise of large-scale pretraining, this problem has become tractable. Examples include:
    • Contrastive: CLIP
    • Autoregressive: DALL-E / Parti
    • Diffusion Models: DALL-E 2 / Imagen / Stable Diffusion
  • These pretrained models may be used as priors for a variety of vision tasks, and we are particularly interested in 3D reconstruction.
    • At a high level, you can think of these models as a tool for optimizing the realism of an input image.
  • In this way, they enable an elegant approach to 3D generation and reconstruction: using these large-scale pretrained models to enforce that a differentiable scene looks realistic from random views.


  1. We propose RealFusion, a method that can extract from a single image of an object a 360◦ photographic 3D reconstruction without assumptions on the type of object imaged or 3D supervision of any kind;

  2. We do so by leveraging an existing 2D diffusion image generator via a new single image variant of textual inversion;

  3. We also introduce new regularizers and provide an efficient implementation using InstantNGP;

  4. We demonstrate state-of-the-art reconstruction results on a number of in-the-wild images and images from existing datasets when compared to alternative approaches.

  • Image-based reconstruction of appearnce and geometry
  • Few-view reconstruction
  • Single-view reconstruction
  • Extracting 3D models from 2D generators
  • Diffusion Models


  • This approach forms the backbone of our method, RealFusion.
  1. [Init] We are given a single image and a function \(\boldsymbol{p}_{\text {prior }}(\cdot)\) which computes the likelihood of an input image \(\boldsymbol{I}\). We choose a camera view and represent our scene with a differentiably-renderable representation \(\boldsymbol{x}\), for example a NeRF.
  2. [Reconstruction] We render \(\boldsymbol{x}\) from our given view and minimize the loss with respect to the real input image \(\mathbf{I}\).
  3. [Prior] We render images \(\boldsymbol{I}_{\text {prior }}\) of \(\boldsymbol{x}\) from randomly-chosen views on a hemisphere surrounding the origin, and we optimize \(\boldsymbol{p}_{\text {prior }}\left(\boldsymbol{I}_{\text {priol }}\right)\) to enforce that \(\boldsymbol{x}\) looks realistic from all directions.
  • Prior work has explored this question in the domain of 3D generation
    • Dreamfields: CLIP prior
    • DreamFusion: Diffusion model prior
  • In our work, we adopt a diffusion model prior using Stable Diffusion, a text-conditional latent diffusion model.
  • As currently stated, our set up combines a reconstruction objective with a latent diffusion-based prior objective, which is conditioned on a manual text prompt (e.g. "An image of a fish.")
  • However, we found that these results were lacking.
  • In particular, the 3D shapes that are generated look like the input object from the input view, but do not look like the input object from other views.
  • To fix this, we need to modify the prior to place a high likelihood on our input object, rather than a generic object with the same description.
  • We do so by performing textual inversion.
    • We optimize a text embedding \(\mathbf{e}\) in the text encoder of the diffusion model to match our input image.
    • Usually textual inversion is performed with multiple views of an object, but we substitute these views with heavy image augmentations.
  • We also add other pieces of regularization:
    1. A regularization on rendered normals
    2. A coarse-to-fine training setup
  • However, the key piece of the puzzle is the textual inversion.



  • Requires per-image optimization
    • Both the textual inversion and the 3D optimization procedure must be performed separately for each input image.
    • As a result, the process is relatively slow and difficult to apply to large datasets
  • In some cases, reconstruction fails to produce a solid shape
    • Perhaps this could be alleviated with better inductive biases or regularization terms
  • In some cases, reconstruction produces two-headed objects
    • This is known as the Janus Problem