In-N-Out: Lifting 2D Diffusion Prior for 3D Object Removal via Tuning-Free Latents Alignment

¹ The University of Melbourne ² Alibaba ³ Google Research ⁴ The University of Sydney ⁵ MBZUAI

NeurIPS, 2024

Abstract

Neural representations for 3D scenes have made substantial advancements recently, yet object removal remains a challenging yet practical issue, due to the absence of multi-view supervision over occluded areas. Diffusion Models (DMs), trained on extensive 2D images, show diverse and high-fidelity generative capabilities in the 2D domain. However, due to not being specifically trained on 3D data, their application to multi-view data often exacerbates inconsistency, hence impacting the overall quality of the 3D output. To address these issues, we introduce ``In-N-Out'', a novel approach that begins by inpainting a prior, i.e., the occluded area from a single view using DMs, followed by outstretching it to create multi-view inpaintings via latents alignments. Our analysis identifies that the variability in DMs' outputs mainly arises from initially sampled latents and intermediate latents predicted in the denoising process. We explicitly align of initial latents using a Neural Radiance Field (NeRF) to establish a consistent foundational structure in the inpainted area, complemented by an implicit alignment of intermediate latents through cross-view attention during the denoising phases, enhancing appearance consistency across views. To further enhance rendering results, we apply a patch-based hybrid loss to optimize NeRF. We demonstrate that our techniques effectively mitigate the challenges posed by inconsistencies in DMs and substantially improve the fidelity and coherence of inpainted 3D representations.

Rendering Results

Original Input

SPIn-NeRF (multi-view)

InFusion (single-view)

Ours

Method

Given a set of multi-view training images {I_i} from a scene, with corresponding masks {M_i} indicating unwanted objects in each frame, our approach aims to generate a consistently inpainted training set. These inpainted images are then used to supervise NeRF. The method is structured into three key stages:

Stage 1: Pretrain a NeRF model (φ) using the original images {I_i}, the masks {M_i}, and a sampled inpainted prior I_p, which provides a rough hallucination of the inpainting features.

Stage 2: Use the pretrained NeRF (φ) to inpaint additional views {I_i | i ≠ p, i = 1, …, N}. This is achieved by aligning both explicit and implicit latents (ELA and ILA), conditioned on the inpainting prior I_p.

Stage 3: Using the inpainted image set {I_i}, optimize the NeRF model (φ) with a patch-based hybrid loss to distill multi-view supervision.

Please refer to our paper for more details.

BibTeX

@inproceedings{hu2024innout, title = {In-N-Out: Lifting 2D Diffusion Prior for 3D Object Removal via Tuning-Free Latents Alignment}, author = {Dongting Hu and Huan Fu and Jiaxian Guo and Liuhua Peng and Tingjin Chu and Feng Liu and Tongliang Liu and Mingming Gong}, booktitle = {The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS)}, year = {2024}, url = {https://openreview.net/forum?id=gffaYDu9mM} }