The synthesis of recent views is a hot topic in computer graphics and vision applications, similar to virtual and augmented reality, immersive photography, and the event of digital replicas. The target is to generate additional views of an object or a scene based on limited initial viewpoints. This task is especially demanding since the newly synthesized views must consider occluded areas and previously unseen regions.
Recently, neural radiance fields (NeRF) have demonstrated exceptional ends in generating high-quality novel views. Nonetheless, NeRF relies on a big variety of images, starting from tens to a whole bunch, to effectively capture the scene, making it vulnerable to overfitting and lacking the power to generalize to latest scenes.
Previous attempts have introduced generalizable NeRF models that condition the NeRF representation based on the projection of 3D points and extracted image features. These approaches yield satisfactory results, particularly for views near the input image. Nonetheless, when the goal views significantly differ from the input, these methods produce blurry outcomes. The challenge lies in resolving the uncertainty related to large unseen regions within the novel views.
An alternate approach to tackle the uncertainty problem in single-image view synthesis involves utilizing 2D generative models that predict novel views while conditioning on the input view. Nonetheless, the chance for these methods is the shortage of consistency in image generation with the underlying 3D structure.
For this purpose, a brand new technique called NerfDiff has been presented. NerfDiff is a framework designed for synthesizing high-quality multi-view consistent images based on single-view input. An outline of the workflow is presented within the figure below.
The proposed approach consists of two stages: training and finetuning.
In the course of the training stage, a camera-space triplane-based NeRF model and a 3D-aware conditional diffusion model (CDM) are jointly trained on a group of scenes. The NeRF representation is initialized using the input image on the finetuning stage. Then, the parameters of the NeRF model are adjusted based on a set of virtual images generated by the CDM, which is conditioned on the NeRF-rendered outputs. Nonetheless, an easy finetuning strategy that optimizes the NeRF parameters directly using the CDM outputs produces low-quality renderings as a consequence of the multi-view inconsistency of the CDM outputs. To deal with this issue, the researchers propose NeRF-guided distillation, an alternating process that updates the NeRF representation and guides the multi-view diffusion process. Specifically, this approach allows the resolution of uncertainty in single-image view synthesis by leveraging the extra information provided by the CDM. Concurrently, the NeRF model guides the CDM to make sure multi-view consistency through the diffusion process.
A few of the results obtained through NerfDiff are reported here below (where NGD stands for Nerf-Guided Distillation).
This was the summary of NerfDiff, a novel AI framework to enable high-quality and consistent multiple views from a single input image. In the event you have an interest, you possibly can learn more about this method within the links below.
Try the Paper and Project. Don’t forget to hitch our 22k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more. If you will have any questions regarding the above article or if we missed anything, be at liberty to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Daniele Lorenzi received his M.Sc. in ICT for Web and Multimedia Engineering in 2021 from the University of Padua, Italy. He’s a Ph.D. candidate on the Institute of Information Technology (ITEC) on the Alpen-Adria-Universität (AAU) Klagenfurt. He’s currently working within the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.