We introduce Free3D, a simple approach designed for open-set novel view synthesis (NVS) from a single image.
Similar to Zero-1-to-3, we start from a pre-trained 2D image generator for generalization, and fine-tune it for NVS. Compared to recent and concurrent works, we obtain significant improvements without resorting to an explicit 3D representation, which is slow and memory-consuming.
We do so by encoding better the target camera pose via a new per-pixel ray conditioning normalization (RCN) layer. The latter injects camera pose information in the underlying 2D image generator by telling each pixel its specific viewing direction. We also improve multi-view consistency via a light-weight multi-view attention layer and multi-view noise sharing. We train Free3D on the Objaverse dataset and demonstrate excellent generalization to various new categories in several large new datasets, including OminiObject3D and Google Scanned Object (GSO).
The overall pipeline of our Free3D. (a) Given a single source input image, the proposed architecture jointly predicts multiple target views, instead of processing them independently. To achieve a consistent novel view synthesis without the need for 3D representation, (b) we first propose a novel ray conditional normalization (RCN) layer, which uses a per-pixel oriented camera ray to module the latent features, enabling the model’s ability to capture more precise viewpoints. (c) A memory-friendly pseudo-3D cross-attention module is introduced to efficiently bridge information across multiple generated views. Note that, here the similarity score is only calculated across multiple views in temporal instead of spatial, resulting in a minimal computational and memory cost.
Free3D significantly improves the accuracy of the generated pose compared to existing state-of-the-art methods on various datasets, including Objaverse (Top two), OminiObject3D (Middle two) and GSO (Bottom two).
Using Free3D, you can directly render a consistent 360-degree video wihout the need of an additional explicit 3D representation or network.
There's a lot of excellent work that was introduced around the same time as ours.
Stable Video Diffusion fine-tunes image-to-video diffusion model for multi-view generation.
Efficient-3DiM fine-tunes the stable diffusion with a stronger vision transformer DINO v2.
Consistent-1-to-3 uses the epipolar-attention to extract coarse results for the diffusion model.
One-2-3-45 and One-2-3-45++ directly train additional 3D network using the outputs of multi-view generator.
MVDream, Consistent123 and Wonder3D also train multi-view diffusion models, yet still requires post-processing for video rendering.
Some works employ 3D representation into the latent diffusion mdoel, such as SyncDreamer and ConsistNet.
Many thanks to Stanislaw Szymanowicz, Edgar Sucar, and Luke Melas-Kyriazi of VGG for insightful discussions and Ruining Li, Eldar Insafutdinov, and Yash Bhalgat of VGG for their helpful feedback. We would also like to thank the authors of Zero-1-to-3 and Objaverse-XL for their helpful discussions.
@article{zheng2023free3D,
author = {Zheng, Chuanxia and Vedaldi, Andrea},
title = {Free3D: Consistent Novel View Synthesis without 3D Representation},
journal = {arXiv},
year = {2023},
}