AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting

Chung-Ho Wu1  Yang-Jung Chen1  Ying-Huan Chen1  Jie-Ying Lee1
Bo-Hsu Ke1  Chun-Wei Tuan Mu1  Yi-Chuan Huang1  Chin-Yang Lin1
Min-Hung Chen2  Yen-Yu Lin1  Yu-Lun Liu1
1National Yang Ming Chiao Tung University  2NVIDIA

https://kkennethwu.github.io/aurafusion360/
Abstract

Three-dimensional scene inpainting is crucial for applications from virtual reality to architectural visualization, yet existing methods struggle with view consistency and geometric accuracy in 360° unbounded scenes. We present AuraFusion360, a novel reference-based method that enables high-quality object removal and hole filling in 3D scenes represented by Gaussian Splatting. Our approach introduces (1) depth-aware unseen mask generation for accurate occlusion identification, (2) Adaptive Guided Depth Diffusion, a zero-shot method for accurate initial point placement without requiring additional training, and (3) SDEdit-based detail enhancement for multi-view coherence. We also introduce 360-USID, the first comprehensive dataset for 360° unbounded scene inpainting with ground truth. Extensive experiments demonstrate that AuraFusion360 significantly outperforms existing methods, achieving superior perceptual quality while maintaining geometric accuracy across dramatic viewpoint changes.

[Uncaptioned image]
Figure 1: Overview of our reference-based 360° unbounded scene inpainting method. Given input images with camera parameters, object masks, and a reference image, our AuraFusion360 approach generates an object-masked Gaussian Splatting representation. This representation can then render novel views of the inpainted scene, effectively removing the masked objects while maintaining consistency with the reference image.

1 Introduction

Three-dimensional scene reconstruction, driven by Neural Radiance Fields [34] and 3D Gaussian Splatting [20], is vital for VR/AR, robotics, and autonomous driving. A key challenge is realistic object removal and hole filling, which is essential for augmented reality and real estate visualization. Inpainting 360° unbounded scenes remains difficult due to the need for multi-view consistency, plausible unseen region extrapolation, and geometric coherence across views.

Fig. 1 shows our reference-based 360° unbounded scene inpainting approach. Given input images with camera parameters, object masks, and a reference image, our method generates an inpainted 3D scene using Gaussian Splatting [20, 17] for novel view rendering. We exploit multi-view information and generative models to fill unseen areas, ensuring coherent and plausible results across views. Integrating Gaussian Splatting’s explicit representation with 2D generative inpainting, our method maintains multi-view consistency and geometric accuracy under significant viewpoint changes.

Refer to caption
Figure 2: Comparison with different 3D inpainting approaches. Existing methods such as SPin-NeRF [36] and GScream [61], designed for forward-facing scenes, perform poorly in 360° scenarios. Reference-based methods like Infusion [29] struggle with accurate depth projection, causing fine-tuning artifacts. Gaussian Grouping [67] frequently misidentifies unseen regions, reducing inpainting quality. Our AuraFusion360 achieves precise unseen masks and improved depth alignment via Adaptive Guided Depth Diffusion, employing SDEdit [32] for diffusion-guided, multi-view consistent RGB generation.

Several critical challenges in 360° unbounded scene inpainting motivated our approach (Fig. 2). Existing methods [36, 61, 35, 37], effective for forward-facing scenes, struggle with extreme viewpoint changes in 360° scenes, resulting in inconsistencies and artifacts. Recent approaches like Gaussian Grouping [67] effectively propagate semantic information for object removal, but their reliance on a text-based tracker [8] often causes misidentified unseen regions, leading to inaccurate reconstructions.

To address these challenges, we propose a unified pipeline for 360° unbounded scene inpainting using Gaussian Splatting for object removal, depth-aware unseen region detection, and multi-view consistent inpainting. Inspired by Gaussian Grouping [67], our method integrates object-masked attributes into Gaussians for precise removal and reconstructs unseen regions before applying reference-guided inpainting. Unlike methods that directly apply inpainters, causing inconsistencies, we develop Adaptive Guided Depth Diffusion (AGDD) to unproject aligned points from the reference view into unseen regions. These points (1) initialize Gaussians and (2) guide inpainted RGB generation via SDEdit [32], ensuring coherent, high-quality 360° scene restoration.

Integrating these improvements, our framework achieves enhanced geometric accuracy and realism in 360° unbounded scenes. To advance 3D inpainting, we propose a method that improves consistency and provides a benchmark for future research. Our contributions include:

  • A depth-aware method leveraging multi-view information to accurately generate unseen masks for 360° unbounded scene inpainting.

  • Integration of reference view unprojection with SDEdit to produce consistent RGB guidance across views.

  • A comprehensive framework with a new 360° dataset and capture protocol, supporting high-quality novel view synthesis and quantitative evaluation.

2 Related Work

NeRF. Neural Radiance Fields (NeRF)[34] revolutionized novel view synthesis via differentiable volume rendering[56, 15] and positional encoding [57, 13]. NeRF models improved in efficiency [27, 12, 7], rendering quality [2, 73, 33], handling dynamic scenes [28], and data efficiency [69, 60, 23, 53]. Despite excelling at view synthesis, NeRF’s implicit representation complicates scene editing. Recent work on object manipulation [65], stylization [58, 14], and inpainting [25, 36, 35] struggles with 3D consistency and structural priors, especially in unbounded scenes.

3D Gaussian Splatting. 3D Gaussian Splatting (3DGS) [20] efficiently represents scenes with explicit 3D Gaussians, enabling faster rendering, easier training, and flexible editing[6]. Recent extensions like Scaffold-GS [30] enhance efficiency with dynamic anchors, while 2DGS [17] refines multi-view geometry. 3DGS has also expanded to dynamic scenes [66, 31, 64, 11] and semantic representations [67, 43], supporting advanced editing and novel view synthesis [44, 17]. Gaussian-based methods thus offer strong potential for explicit 3D inpainting.

Traditional and learning-based image inpainting. Early image inpainting techniques, including PDE-based [4], exemplar-based [9], and PatchMatch [1], were effective for small regions but struggled with complex textures and large gaps [18, 24]. Deep learning advanced the field significantly, starting with Context Encoders [40] and GAN-based methods like DeepFill [71, 72], improving content synthesis and coherence. Recent models such as LaMa [54] use Fourier convolutional networks to address large masks. Diffusion models [16], notably Stable Diffusion [46], introduced iterative refinement capabilities, providing more flexible and structurally consistent inpainting compared to GANs [10].

Diffusion models for image editing and inpainting. Beyond direct inpainting, diffusion models are widely used for image editing. SDEdit [32] injects Gaussian noise and iteratively denoises, enabling semantic edits while preserving global structure. Noise inversion techniques [39, 38], such as DDIM Inversion [52], further improve editing fidelity by enabling precise latent inference through deterministic reverse diffusion. Inpainting-specific diffusion models like SDXL-Inpainting [41] enhance image reconstruction by fine-tuning Stable Diffusion. Reference-based methods [55], such as LeftRefill [5], use diffusion models for reference-guided synthesis but struggle in regions distant from reference views. Despite advancements, Stable Diffusion-based inpainting [42] still suffers from inconsistent artifacts in scene-dependent contexts, causing multi-view inconsistencies problematic for 3D scenes [21]. This motivates our use of SDEdit and DDIM Inversion to preserve structural information and ensure multi-view coherence.

3D scene inpainting. Existing 3D inpainting methods for NeRF [63, 36, 51, 68, 22] typically adapt 2D models to NeRF’s implicit representation. For instance, SPIn-NeRF [36] employs perceptual loss to improve multi-view consistency. Reference-based methods [35, 37, 61] enhance consistency using reference images but remain limited to small-angle view rendering, restricting their use in 360° scenes. NeRFiller [62] iteratively refines consistency with grid prior but struggles with fine-grained textures due to image downsampling. InNeRF360 [59] handles 360° scenes via density hallucination but has limited scene utilization. Gaussian Splatting-based methods like Gaussian Grouping [67] inject semantic information, while InFusion [29] employs depth completion but requires manual view selection. GScream [30] integrates Scaffold-GS but faces difficulties in unbounded 360° scenes. Our method addresses these issues by enhancing multi-view consistency and depth-aware inpainting in 360° scenarios using Gaussian Splatting.

Refer to caption
Figure 3: Overview of our method. Our approach takes multi-view RGB images and corresponding object masks as input and outputs a Gaussian representation with the masked objects removed. The pipeline consists of three main stages: (a) Depth-Aware Unseen Masks Generation to identify truly occluded areas, referred to as the “unseen region”, (b) Depth-Aligned Gaussian Initialization on Reference View to fill unseen regions with initialized Gaussian containing reference RGB information after object removal, and (c) SDEdit-Based RGB Guidance for Detail Enhancement, which enhances fine details using an inpainting model while preserving reference view information. Instead of applying SDEdit with random noise, we use DDIM Inversion on the rendered initial Gaussians to generate noise that retains the structure of the reference view, ensuring multi-view consistency across all RGB Guidance.

3 Method

Our method processes multi-view RGB images {In}subscript𝐼𝑛\left\{I_{n}\right\}{ italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } and object masks {Mn}subscript𝑀𝑛\left\{M_{n}\right\}{ italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, n[1..N]n\in\left[1..N\right]italic_n ∈ [ 1 . . italic_N ], to produce an inpainted Gaussian representation with removed objects. Occluded regions (unseen regions [67]) are consistently inpainted across views. As shown in Fig. 3, the process includes training a masked Gaussian using object masks, removing objects, and applying (a) Depth-Aware Unseen Mask Generation (Sec. 3.1), (b) Reference View Initial Gaussians Alignment (Sec. 3.2), and (c) SDEdit for Detail Enhancement (Sec. 3.3). This pipeline ensures consistent texture propagation in unbounded scenes, achieving high-quality 3D inpainting.

3.1 Depth-Aware Unseen Mask Generation

Accurate identification of inpainting regions is critical for scene consistency and optimal use of background information. To generate the unseen mask for a view, it is necessary to differentiate between (1) the background visible across multiple views and (2) the unseen region occluded in all views, requiring inpainting.

A naive approach to detecting unseen masks with SAM2 [45] involves manually selecting the first view and propagating prompts across other views. However, SAM2 struggles to consistently detect unseen regions without refinement, often revealing parts of the background or inside objects. To address this, our method employs depth warping to generate bounding box prompts for each view (Fig. 4), ensuring accurate, fully automated unseen region detection.

Refer to caption
Figure 4: Overview of the Unseen Mask Generation Process using Depth Warping. To obtain the unseen mask for view n𝑛nitalic_n, we calculate the pixel correspondences between the view n𝑛nitalic_n and all other views i𝑖iitalic_i by using the rendered incomplete depth Dnincompletesuperscriptsubscript𝐷𝑛incompleteD_{n}^{\text{incomplete}}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT incomplete end_POSTSUPERSCRIPT. For each view i𝑖iitalic_i, the removal region Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is backward traversal to view n𝑛nitalic_n to align occlusions. We then aggregate the results from multiple views, averaging and applying a threshold to produce the initial contour of the unseen mask. This contour is subsequently converted into a bounding box prompt for SAM2 [45], which refines the unseen mask to its final version for view n𝑛nitalic_n.

Depth warping for generating bbox prompt to SAM2. To refine the unseen mask, we employ a depth-warping technique, as illustrated in Fig. 4. For each view n𝑛nitalic_n, we compute:

Rin=𝒲traverse(Ri,Dnincomplete,Tni),subscript𝑅𝑖𝑛subscript𝒲traversesubscript𝑅𝑖superscriptsubscript𝐷𝑛incompletesubscript𝑇𝑛𝑖R_{i\rightarrow n}=\mathcal{W}_{\text{traverse}}(R_{i},D_{n}^{\text{incomplete% }},T_{n\rightarrow i}),italic_R start_POSTSUBSCRIPT italic_i → italic_n end_POSTSUBSCRIPT = caligraphic_W start_POSTSUBSCRIPT traverse end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT incomplete end_POSTSUPERSCRIPT , italic_T start_POSTSUBSCRIPT italic_n → italic_i end_POSTSUBSCRIPT ) , (1)

where 𝒲traversesubscript𝒲traverse\mathcal{W}_{\text{traverse}}caligraphic_W start_POSTSUBSCRIPT traverse end_POSTSUBSCRIPT includes forward warping from view n𝑛nitalic_n to i𝑖iitalic_i and backward traversal to map the removal region back to n𝑛nitalic_n. Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the removal region mask for view i𝑖iitalic_i, derived from depth differences. Dnincompletesuperscriptsubscript𝐷𝑛incompleteD_{n}^{\text{incomplete}}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT incomplete end_POSTSUPERSCRIPT is the incomplete depth map for view n𝑛nitalic_n, and Tnisubscript𝑇𝑛𝑖T_{n\rightarrow i}italic_T start_POSTSUBSCRIPT italic_n → italic_i end_POSTSUBSCRIPT is the transformation from view n𝑛nitalic_n to i𝑖iitalic_i.

The unseen mask contour for view n𝑛nitalic_n is obtained by aggregating warped removal regions and applying thresholding:

Cn=θ(1Ki=1KRin)Rn,subscript𝐶𝑛𝜃1𝐾superscriptsubscript𝑖1𝐾subscript𝑅𝑖𝑛subscript𝑅𝑛C_{n}=\theta\left(\frac{1}{K}\sum_{i=1}^{K}R_{i\rightarrow n}\right)\cap R_{n},italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_θ ( divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_i → italic_n end_POSTSUBSCRIPT ) ∩ italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , (2)

where Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the contour of the unseen mask, K𝐾Kitalic_K is the number of views, and θ𝜃\thetaitalic_θ is a thresholding function. A bounding box bbox(Cn)bboxsubscript𝐶𝑛\text{bbox}(C_{n})bbox ( italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is created as a prompt for SAM2 [45] to generate the final unseen mask:

Un=SAM2(bbox(Cn)).subscript𝑈𝑛SAM2bboxsubscript𝐶𝑛U_{n}=\text{SAM2}(\text{bbox}(C_{n})).italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = SAM2 ( bbox ( italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) . (3)

This mask Unsubscript𝑈𝑛U_{n}italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT guides the inpainting process, focusing on areas needing reconstruction while preserving original scene information.

3.2 Reference View Initial Gaussians Alignment

After performing object removal and generating the unseen mask, similar to CorrFill [26], we select a reference view called Vrefsubscript𝑉refV_{\text{ref}}italic_V start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT, which can render an incomplete RGB image and depth. We then apply RGB inpainting to the incomplete RGB image of Vrefsubscript𝑉refV_{\text{ref}}italic_V start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT and denote it as Irefsubscript𝐼refI_{\text{ref}}italic_I start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT. To maximize cross-view consistency, we project the reference RGB image into 3D space using depth estimates of Irefsubscript𝐼refI_{\text{ref}}italic_I start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT, which is obtained through Adaptive Guided Depth Diffusion. This 3D projection serves two critical purposes: It guides the SDEdit-based RGB detail enhancement and initializes point positions for Gaussian fine-tuning. Accurate depth alignment is, therefore, fundamental to our pipeline, as it directly determines the precision of these initial point positions.

Adaptive Guided Depth Diffusion (AGDD). Aligning estimated depth with existing depth is challenging due to monocular depth estimation [19]’s scale ambiguity and non-metric representation across coordinate systems. This challenge intensifies in 360° unbounded scenes, where large viewpoint changes hinder alignment. Traditional scale-shift optimization often yields suboptimal results, while depth-completion models demand costly fine-tuning. Our AGDD refines GDD [70] by addressing over-alignment issues, particularly where depth transitions from small to large values, which exaggerates disparities in distant regions and inflates loss values. To mitigate this, we introduce an adaptive loss Ladaptivesubscript𝐿adaptiveL_{\text{adaptive}}italic_L start_POSTSUBSCRIPT adaptive end_POSTSUBSCRIPT that balances alignment, preventing distant regions from dominating and yielding more accurate depth estimates.

The framework is shown in Fig. 5. Following the standard denoising process of Marigold [19], we initialize with a latent representation perturbed by full-strength Gaussian noise, denoted as dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and generate aligned depth Dalignedsubscript𝐷alignedD_{\text{aligned}}italic_D start_POSTSUBSCRIPT aligned end_POSTSUBSCRIPT = Decoder(d0)Decodersubscript𝑑0\text{Decoder}(d_{0})Decoder ( italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) using a VAE decoder, where the latent d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is obtained by recursive denoising step dt1=Denoise(dt,t,ϵ^t)subscript𝑑𝑡1Denoisesubscript𝑑𝑡𝑡subscript^italic-ϵ𝑡d_{t-1}=\text{Denoise}(d_{t},t,\hat{\epsilon}_{t})italic_d start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = Denoise ( italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The ϵ^tsubscript^italic-ϵ𝑡\hat{\epsilon}_{t}over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is derived by updating the original noise through the calculation of adaptive loss Ladaptivesubscript𝐿adaptiveL_{\text{adaptive}}italic_L start_POSTSUBSCRIPT adaptive end_POSTSUBSCRIPT between the pre-decoded estimated depth Dt1subscript𝐷𝑡1D_{t-1}italic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and the existing incomplete depth Dincompletesubscript𝐷incompleteD_{\text{incomplete}}italic_D start_POSTSUBSCRIPT incomplete end_POSTSUBSCRIPT. Note that Dt1subscript𝐷𝑡1D_{t-1}italic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT is obtained by decoding d0superscriptsubscript𝑑0d_{0}^{{}^{\prime}}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT, which is the model’s estimation of the fully denoised latent at timestep 00 when predicted from the noisy state at timestep t1𝑡1t-1italic_t - 1. This adaptive loss refines ϵ^tsubscript^italic-ϵ𝑡\hat{\epsilon}_{t}over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to ensure that the estimated depth aligns with the existing incomplete depth during denoising. The optimization process is described as follows:

dt1=Denoise(dt,t,ϵ^t)subscript𝑑𝑡1Denoisesubscript𝑑𝑡𝑡subscript^italic-ϵ𝑡d_{t-1}=\text{Denoise}(d_{t},t,\hat{\epsilon}_{t})italic_d start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = Denoise ( italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (4)
ϵ^t=UNet(dt,Iscene,t)αadpativesubscript^italic-ϵ𝑡UNetsubscript𝑑𝑡subscript𝐼scene𝑡𝛼subscriptadpative\hat{\epsilon}_{t}=\text{UNet}(d_{t},I_{\text{scene}},t)-\alpha\cdot\nabla% \mathcal{L}_{\text{adpative}}over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = UNet ( italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT scene end_POSTSUBSCRIPT , italic_t ) - italic_α ⋅ ∇ caligraphic_L start_POSTSUBSCRIPT adpative end_POSTSUBSCRIPT (5)

where α𝛼\alphaitalic_α is the learning rate for the optimization. We define a bounding box \mathcal{B}caligraphic_B around the unseen region and introduce a threshold δ𝛿\deltaitalic_δ to downweight errors for distant points. The adaptive loss adaptivesubscriptadaptive\mathcal{L}_{\text{adaptive}}caligraphic_L start_POSTSUBSCRIPT adaptive end_POSTSUBSCRIPT between the pre-decoded estimated depth Dt1subscript𝐷𝑡1D_{t-1}italic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and the incomplete depth Dincompletesubscript𝐷incompleteD_{\text{incomplete}}italic_D start_POSTSUBSCRIPT incomplete end_POSTSUBSCRIPT is computed as follows:

Mguide(x,y)={1if (x,y)U0otherwise,subscript𝑀guide𝑥𝑦cases1if 𝑥𝑦𝑈0otherwiseM_{\text{guide}}(x,y)=\begin{cases}1&\text{if }(x,y)\in\mathcal{B}\setminus U% \\ 0&\text{otherwise},\end{cases}italic_M start_POSTSUBSCRIPT guide end_POSTSUBSCRIPT ( italic_x , italic_y ) = { start_ROW start_CELL 1 end_CELL start_CELL if ( italic_x , italic_y ) ∈ caligraphic_B ∖ italic_U end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise , end_CELL end_ROW (6)
adaptive=(x,y)Mguide(x,y)(Dt1,Dincomplete)(x,y),subscriptadaptivesubscript𝑥𝑦subscript𝑀guide𝑥𝑦subscript𝐷𝑡1subscript𝐷incomplete𝑥𝑦\mathcal{L}_{\text{adaptive}}=\sum_{(x,y)}M_{\text{guide}}(x,y)\cdot\mathcal{L% }(D_{t-1},D_{\text{incomplete}})(x,y),caligraphic_L start_POSTSUBSCRIPT adaptive end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT guide end_POSTSUBSCRIPT ( italic_x , italic_y ) ⋅ caligraphic_L ( italic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT incomplete end_POSTSUBSCRIPT ) ( italic_x , italic_y ) , (7)
(d1,d2)={12(d1d2)2if |d1d2|<δδ|d1d2|12δ2otherwise,subscript𝑑1subscript𝑑2cases12superscriptsubscript𝑑1subscript𝑑22if subscript𝑑1subscript𝑑2𝛿𝛿subscript𝑑1subscript𝑑212superscript𝛿2otherwise,\mathcal{L}(d_{1},d_{2})=\begin{cases}\frac{1}{2}(d_{1}-d_{2})^{2}&\text{if }|% d_{1}-d_{2}|<\delta\\ \delta\cdot|d_{1}-d_{2}|-\frac{1}{2}\delta^{2}&\text{otherwise,}\end{cases}caligraphic_L ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = { start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL if | italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | < italic_δ end_CELL end_ROW start_ROW start_CELL italic_δ ⋅ | italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL otherwise, end_CELL end_ROW (8)

where Mguide(x,y)subscript𝑀guide𝑥𝑦M_{\text{guide}}(x,y)italic_M start_POSTSUBSCRIPT guide end_POSTSUBSCRIPT ( italic_x , italic_y ) is a mask function indicating if a pixel (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) is within the bounding box \mathcal{B}caligraphic_B but not in the unseen mask U. At each denoising step, we update the noise over N𝑁Nitalic_N iterations. Instead of directly optimizing the noise using L2 loss [70], this loss ensures that the updated noise input to the denoiser enables it to generate an estimated depth that aligns with the incomplete guided depth. This enables the AGDD output to achieve accurate alignment in regions adjacent to unseen areas, which is more appropriate for depth inpainting scenarios while also operating in a zero-shot manner.

Refer to caption
Figure 5: Overview of Adaptive Guided Depth Diffusion (AGDD). The framework takes image latent, incomplete depth, and unseen mask as inputs to generate aligned depth estimates. (a) The guided region is identified by dilating the unseen mask and subtracting the original mask. (b) At each timestep t𝑡titalic_t, adaptive loss adaptivesubscriptadaptive\mathcal{L}_{\text{adaptive}}caligraphic_L start_POSTSUBSCRIPT adaptive end_POSTSUBSCRIPT is computed between the pre-decoded and incomplete depth to update the noise input ϵ^tsubscript^italic-ϵ𝑡\hat{\epsilon}_{t}over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This repeats N𝑁Nitalic_N times before advancing to the next denoising step, ensuring the estimated depth aligns with the incomplete depth distribution in the guided region.

Initializing Gaussians in unseen regions. With the aligned depth Dalignedrefsuperscriptsubscript𝐷alignedrefD_{\text{aligned}}^{\text{ref}}italic_D start_POSTSUBSCRIPT aligned end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT of the reference view, we proceed to initialize new Gaussians in the unseen regions. First, we unproject the inpainted RGB of the reference view with Dalignedrefsuperscriptsubscript𝐷alignedrefD_{\text{aligned}}^{\text{ref}}italic_D start_POSTSUBSCRIPT aligned end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT to 3D space, focusing on the unseen regions identified by the unseen mask. This unprojection takes into account the camera’s intrinsic parameters. For each pixel (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) in the unseen region where Ufinal(u,v)=1subscript𝑈final𝑢𝑣1U_{\text{final}}(u,v)=1italic_U start_POSTSUBSCRIPT final end_POSTSUBSCRIPT ( italic_u , italic_v ) = 1, we compute the 3D point P=(X,Y,Z)𝑃𝑋𝑌𝑍P=(X,Y,Z)italic_P = ( italic_X , italic_Y , italic_Z ) as Z=Dalignedref(u,v)𝑍superscriptsubscript𝐷alignedref𝑢𝑣Z=D_{\text{aligned}}^{\text{ref}}(u,v)italic_Z = italic_D start_POSTSUBSCRIPT aligned end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT ( italic_u , italic_v ), X=(ucx)Z/fx𝑋𝑢subscript𝑐𝑥𝑍subscript𝑓𝑥X=(u-c_{x})\cdot Z/f_{x}italic_X = ( italic_u - italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) ⋅ italic_Z / italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, Y=(vcy)Z/fy,𝑌𝑣subscript𝑐𝑦𝑍subscript𝑓𝑦Y=(v-c_{y})\cdot Z/f_{y},italic_Y = ( italic_v - italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ⋅ italic_Z / italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ,, where (fx,fy)subscript𝑓𝑥subscript𝑓𝑦(f_{x},f_{y})( italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) are the focal lengths in pixels and (cx,cy)subscript𝑐𝑥subscript𝑐𝑦(c_{x},c_{y})( italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) are the principal point offsets. This process gives us a set of initial 3D points P𝑃Pitalic_P. These points are then used to initialize new Gaussians in the unseen regions, inheriting color from the reference view. Existing background Gaussians, unaffected by object removal, remain fixed during initialization and optimization. These initialized Gaussians are crucial for the subsequent process of generating guided inpaint RGB images and optimization.

3.3 SDEdit for Detail Enhancement

After initializing Gaussians in unseen regions, we aim to obtain the inpainted RGB guidance with fine details while ensuring multi-view consistency, which further refines our initial Gaussians during fine-tuning. Inspired by SDEdit [32], we refine the rendered initial Gaussians by adding scaled noise proportional to a strength factor s𝑠sitalic_s, ensuring that the inpainting model retains structural information from the reference view while allowing for detail refinement across multiple perspectives. We further find that instead of injecting random Gaussian noise, applying DDIM Inversion [52] to the rendered initial Gaussians better preserves their structural information during the denoising process. This approach allows the diffusion inpainting model to reconstruct missing details while maintaining alignment with the reference view, ensuring that inpainted regions integrate seamlessly into the scene (see Fig. 11).

Specifically, given a rendered training view Iinitsubscript𝐼initI_{\text{init}}italic_I start_POSTSUBSCRIPT init end_POSTSUBSCRIPT, we first obtain its corresponding noise representation via DDIM Inversion, capturing the essential structure of the reference view in the latent space. Instead of inverting fully to t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we compute an intermediate timestep tinvsubscript𝑡invt_{\text{inv}}italic_t start_POSTSUBSCRIPT inv end_POSTSUBSCRIPT based on the noise strength s𝑠sitalic_s:

tinv=T(1s),subscript𝑡inv𝑇1𝑠t_{\text{inv}}=T(1-s),italic_t start_POSTSUBSCRIPT inv end_POSTSUBSCRIPT = italic_T ( 1 - italic_s ) , (9)

where T𝑇Titalic_T is the total number of timesteps in the diffusion process, and s𝑠sitalic_s controls the noise strength. We then perform DDIM Inversion to obtain the noise representation at tinvsubscript𝑡invt_{\text{inv}}italic_t start_POSTSUBSCRIPT inv end_POSTSUBSCRIPT:

ϵinv=DDIM-Invert(Iinit,tinv).subscriptitalic-ϵinvDDIM-Invertsubscript𝐼initsubscript𝑡inv\epsilon_{\text{inv}}=\text{DDIM-Invert}(I_{\text{init}},t_{\text{inv}}).italic_ϵ start_POSTSUBSCRIPT inv end_POSTSUBSCRIPT = DDIM-Invert ( italic_I start_POSTSUBSCRIPT init end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT inv end_POSTSUBSCRIPT ) . (10)

Next, we denoise this noise using a 2D diffusion inpainting model, conditioned on the reference view Irefsubscript𝐼refI_{\text{ref}}italic_I start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT, ensuring that the reconstructed details align with the global scene while maintaining consistency across views:

Iguided=Denoise(ϵinv,condition=Iref,tinv0).I_{\text{guided}}=\text{Denoise}(\epsilon_{\text{inv}},\text{condition}=I_{% \text{ref}},t_{\text{inv}\rightarrow}0).italic_I start_POSTSUBSCRIPT guided end_POSTSUBSCRIPT = Denoise ( italic_ϵ start_POSTSUBSCRIPT inv end_POSTSUBSCRIPT , condition = italic_I start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT inv → end_POSTSUBSCRIPT 0 ) . (11)

By inverting to a noise level corresponding to strength s𝑠sitalic_s, this step ensures that the inpainting model refines details while maintaining geometric consistency with the reference view. Unlike traditional SDEdit, which applies random noise addition before denoising, our approach leverages DDIM Inversion to obtain structured noise that aligns with the scene, preventing hallucinated details that could disrupt multi-view coherence.

The resulting guided inpainted RGBs are then used as supervision for Gaussian fine-tuning, updating only the unprojected Gaussians from Sec. 3.2. The final reconstruction is optimized using a combination of L1, SSIM, and LPIPS [74] losses:

=(1λSSIM)1+λSSIMSSIM+λLPIPSLPIPS.1subscript𝜆SSIMsubscript1subscript𝜆SSIMsubscriptSSIMsubscript𝜆LPIPSsubscriptLPIPS\mathcal{L}=(1-\lambda_{\text{SSIM}})\mathcal{L}_{1}+\lambda_{\text{SSIM}}% \mathcal{L}_{\text{SSIM}}+\lambda_{\text{LPIPS}}\mathcal{L}_{\text{LPIPS}}.caligraphic_L = ( 1 - italic_λ start_POSTSUBSCRIPT SSIM end_POSTSUBSCRIPT ) caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT SSIM end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT SSIM end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT LPIPS end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT LPIPS end_POSTSUBSCRIPT . (12)

3.4 Implementation Details

We use the 2D Gaussian Splatting [17] codebase for Gaussian representation to obtain accurate rendered depth, with SAM2 generating object masks on the first frame for each training view. Masked Gaussians enable effective object removal due to their explicit representation. We set the aggregation threshold of θ𝜃\thetaitalic_θ to 0.6 in unseen mask generation. In AGDD, incomplete depth are normalized to match Marigold’s [19] depth. With N𝑁Nitalic_N set to 8, the denoised result is then unnormalized back to its original scale. The entire inference process takes approximately 1 minute on an RTX 4090 GPU. The noise strength of SDEdit s=0.85𝑠0.85s=0.85italic_s = 0.85 balances initial point retention, as shown in our ablation study. We condition the generation on the reference view using LeftRefill [5]. During Gaussian fine-tuning, we run 10,000 iterations with λSSIM=0.8subscript𝜆SSIM0.8\lambda_{\text{SSIM}}=0.8italic_λ start_POSTSUBSCRIPT SSIM end_POSTSUBSCRIPT = 0.8 and LPIPS=0.5subscriptLPIPS0.5\mathcal{L}_{\text{LPIPS}}=0.5caligraphic_L start_POSTSUBSCRIPT LPIPS end_POSTSUBSCRIPT = 0.5.

Refer to caption
Figure 6: Overview of the 360-USID dataset. Sample images from each scene, including five outdoor scenes (Carton, Cone, Newcone, Skateboard, Plant) and two indoor scenes (Cookie, Sunflower). (Bottom right) The table shows statistics for each scene, including the number of training views and ground truth (GT) novel views. The dataset provides a diverse range of environments for evaluating 3D inpainting methods in both indoor and outdoor settings.
Refer to caption
Figure 7: Illustration of the data capture process for the 360-USID dataset. (a) Capturing training views: Multiple images are taken around the object in the scene. (b) Capturing the reference view: A camera is mounted on a tripod to capture a fixed reference view (with an object). (c) Capturing novel views: After removing the object, additional images are taken from various viewpoints, including one from the same tripod position as the reference image.

4 360 Unbounded Scenes Inpainting Dataset

To address the lack of reference-based 360° inpainting datasets, we introduce the 360° Unbounded Scenes Inpainting Dataset (360-USID), consisting of seven scenes with training views (RGB images and object masks), novel testing views (inpainting ground truth), and a reference view (without objects) for evaluating with other reference-based methods.

Dataset collection protocol. We developed a protocol using a standard camera to create this dataset, as simultaneously capturing multi-view photos with and without objects typically requires specialized equipment. Our protocol, illustrated in Fig. 7, consists of:

  1. 1.

    Positioning an object (e.g. a vase) on a textured surface within a 360° unbounded scene. Training views are captured in two complete circular trajectories around the object - the first focuses primarily on the object, while the second maximizes background coverage to ensure comprehensive scene capture.

  2. 2.

    Securing the camera on a tripod and capturing a reference view from a fixed position and orientation.

  3. 3.

    After object removal, capturing novel views from both the fixed tripod position and additional positions distinct from training trajectories for ground truth evaluation.

To ensure high-quality captures, we record video at 4K 60fps with stabilized camera settings and extract the sharpest frames using the variance of the Laplacian method. Each scene comprises 180similar-to\sim200 training views and approximately 30 testing views for quantitative evaluations. Consistent lighting is maintained throughout to minimize shadow variations between reference and testing images

Data preprocessing and pose estimation. Our processing pipeline begins with using COLMAP [49, 50] or similar SfM pipelines like hloc [47, 48] to compute a shared 3D coordinate space for both training and novel views. We then generate object masks for training views using SAM2 [45] and mask out object regions in COLMAP reconstruction. After obtaining camera poses, we process the training images with NeRF/3DGS inpainting methods and render novel views for comparison against ground truth. Finally, we refine testing views by training a masked-3DGS model and selecting optimal frames based on PSNR scores computed outside object regions, yielding approximately 30 high-quality test views per scene. The resulting dataset provides a comprehensive benchmark for evaluating 360° inpainting methods across diverse scenes and viewpoints, with particular attention to view consistency and geometric accuracy.

Scene descriptions. Our 360-USID dataset, shown in Fig. 6, contains seven diverse scenes: five outdoor (Carton, Cone, Newcone, Plant, Skateboard) and two indoor (Cookie, Sunflower). Each scene includes 180-200 training images at 3840×\times×2160 resolution (Plant at 1920×\times×1440), 30 ground truth testing images, and one reference image without objects. Scenes are downscaled to 960×\times×540 for evaluation, providing a comprehensive benchmark for testing 3D inpainting methods across varied real-world environments.

5 Experiments

5.1 Experimental setup

Datasets. We evaluate on two 360° unbounded scene datasets: (1) 360-USID (Ours): A new dataset of 7 scenes (3 indoor, 4 outdoor) for evaluating 360° inpainting, with 200-300 training views containing objects, around 30 test views without objects, and 1 reference. All images are processed at 960px width to preserve details for quantitative evaluation. (2) Other-360 [3] We collect additional 6 standard 360° unbounded scene datasets from NeRF[34], MipNeRF-360[3] and Instruct-NeRF2NeRF[14] for qualitative evaluation at 1/4 resolution, with frame 0 as reference for all methods.

Metrics. We evaluate our method using two complementary metrics: LPIPS (Learned Perceptual Image Patch Similarity) [74] for perceptual quality and PSNR (Peak Signal-to-Noise Ratio) for reconstruction accuracy. Following SPIn-NeRF [36], we compute these metrics only within object masks to focus on inpainting quality. While both metrics are used for 360-USID, which has ground truth, only qualitative assessment is possible for Other-360. Additional evaluation results are provided in supplementary materials.

Table 1: Quantitative comparison of 360° inpainting methods on the 360-USID dataset. Red text indicates the best, and blue text indicates the second-best performing method.
PSNR \uparrow / LPIPS \downarrow Carton Cone Cookie Newcone Plant Skateboard Sunflower Average
SPIn-NeRF [36] 16.659 / 0.539 15.438 / 0.389 11.879 / 0.521 17.131 / 0.519 16.850 / 0.401 15.645 / 0.675 23.538 / 0.206 16.734 / 0.464
2DGS [17] + LaMa [54] 16.433 / 0.499 15.591 / 0.351 11.711 / 0.538 16.598 / 0.670 14.491 / 0.564 15.520 / 0.639 23.024 / 0.194 16.195 / 0.494
2DGS [17] + LeftRefill [5] 15.157 / 0.567 16.143 / 0.372 12.458 / 0.526 16.717 / 0.677 12.856 / 0.666 16.429 / 0.634 24.216 / 0.181 16.282 / 0.518
LeftRefill [5] 14.667 / 0.560 14.933 / 0.380 11.148 / 0.519 16.264 / 0.448 16.183 / 0.463 14.912 / 0.572 18.851 / 0.331 15.280 / 0.468
Gaussian Grouping [67] 16.695 / 0.502 14.549 / 0.366 11.564 / 0.731 16.745 / 0.533 16.175 / 0.440 16.002 / 0.577 20.787 / 0.209 16.074 / 0.480
GScream [61] 14.609 / 0.587 14.655 / 0.476 12.733 / 0.429 13.662 / 0.605 16.238 / 0.437 12.941 / 0.626 18.470 / 0.436 14.758 / 0.514
Infusion [29] 14.191 / 0.555 14.163 / 0.439 12.051 / 0.486 9.562 / 0.624 16.127 / 0.406 13.624 / 0.638 21.195 / 0.238 14.416 / 0.484
AuraFusion360 (Ours) w/o SDEdit 13.731 / 0.477 14.260 / 0.390 12.332 / 0.445 16.646 / 0.460 17.609 / 0.319 15.107 / 0.580 24.884 / 0.170 16.367 / 0.406
AuraFusion360 (Ours) 17.675 / 0.473 15.626 / 0.332 12.841 / 0.434 17.536 / 0.426 18.001 / 0.322 17.007 / 0.559 24.943 / 0.173 17.661 / 0.388
Refer to caption
Figure 8: Visual Comparison on our 360-USID dataset. We compare our method against state-of-the-art approaches including Gaussian Grouping [67], 2DGS + LeftRefill, and Infusion [29]. While Gaussian Grouping struggles with misidentifying unseen regions, leading to floating artifacts, and 2DGS + LeftRefill faces view consistency issues, our method successfully maintains geometric consistency and preserves fine details across different viewpoints. Ground truth (GT) is shown for reference, and the original scene with an object is provided in the first row for comparison.

5.2 Comparisons with State-of-the-Art Methods

Quantitative comparisons. We evaluate AuraFusion360 against state-of-the-art approaches on the 360-USID dataset. Tab. 1 shows PSNR and LPIPS scores across different scenes. Our method consistently outperforms existing approaches. SPIn-NeRF [36]111We implement SPin-NeRF’s method on the 2D Gaussian Splatting codebase to extend its capabilities to 360° unbounded scenes.and Infusion [29] struggle with 360° consistency, while Gaussian Grouping [67] misidentifies the unseen region, causing significant floating artifacts. GScream [61] fails to properly remove objects, and LeftRefill [5] improves but still falls short in 360° environments. 2DGS + LaMa [54] and 2DGS + LeftRefill outperform 2D methods but face view consistency challenges. Our method achieves the highest PSNR score and the lowest average LPIPS, indicating superior perceptual quality and better similarity to the ground truth. The performance gap is especially noticeable in scenes with complex geometry or large removed objects, demonstrating our method’s ability to leverage multi-view information and maintain 360° consistency. The code for InNeRF360 [59] could not be successfully executed, and [35] did not provide code, so we were unable to compare our method with theirs.

Qualitative visual comparisons. Fig. 8 compares our AuraFusion360 method against state-of-the-art approaches on challenging scenes from the 360-USID dataset. Our method excels in maintaining view consistency and preserving fine details in 360° unbounded environments. Additional qualitative results on other 360 datasets and failure cases are provided in the supplementary material.

Table 2: Ablation study of our AuraFusion360.
Depth init. SDEdit strength PSNR \uparrow LPIPS \downarrow
(Sec. 3.2) (Sec. 3.3)
0.85 16.638 0.456
0.5 17.646 0.393
1.0 17.512 0.391
0.85 17.661 0.388

5.3 Ablation Studies

To evaluate the effectiveness of each component in our AuraFusion360 method, we conduct a series of ablation studies. Tab. 2 presents the quantitative results of these studies.

Unseen mask generation. We compared our unseen mask generation method with SAM2 [45] and Gaussian Grouping [67] tracker in  Fig. 9 and  Fig. 10. Our approach significantly improves inpainting quality, particularly in areas occluded from multiple views. The unseen masks identify truly occluded regions, leading to more accurate and consistent inpainting results. This is especially noticeable in scenes with complex geometries, where object masks alone may not capture all necessary information for effective inpainting.

Refer to caption
Figure 9: Visual comparison of unseen mask generation method. Our method enables SAM2 [45] to generate more accurate predictions for each view without the need for manually provided prompts, as the bounding box prompts are automatically generated through depth warping.
Refer to caption
Figure 10: Compared Unseen Mask w/ Gaussian Grouping. Gaussian Grouping [67] uses a video tracker [8] and the “black blurry hole” prompt for DEVA [8] to track the unseen region. However, this can result in tracking errors, affecting inpainting. In contrast, our geometry-based approach uses depth warping to estimate the unseen region’s contour, reducing segmentation errors.
Refer to caption
Figure 11: Compared to other depth completion methods. The depth completion model in Infusion [29] (a) performs better at depth alignment compared to traditional methods (b) and (c), but it produces noisy depth in unseen regions. Similarly, (d) Guided Depth Diffusion [70] struggles to achieve precise alignment, as the background regions amplify the loss, leading to misalignment. In contrast, (e) Our AGDD effectively addresses these issues.

Effect of reference view initial Gaussians alignment. Tab. 2 and  Fig. 11 show that our depth-aware 3DGS initialization accurately estimates aligned depth while maintaining geometric consistency in the inpainted regions. Compared to random initialization, our method produces more structurally coherent results, particularly in areas with significant depth variations. This is especially evident in scenes where the inpainted geometry needs to blend seamlessly with the existing scene structure.

6 Conclusion

We presented AuraFusion360, a novel reference-based 360° inpainting method for 3D scenes in unbounded environments. Our approach effectively addresses the challenges of object removal and hole filling in complex 3D scenes. Key contributions include leveraging multi-view information through improved unseen mask generation, integrating reference-guided 3D inpainting with diffusion priors, and introducing the 360-USID dataset for comprehensive evaluation. Experimental results demonstrate AuraFusion360’s superior performance over existing methods, particularly in complex geometries and large view variations. While this work represents a significant advancement in 3D scene editing, future work will focus on computational efficiency, dynamic scenes, and language-guided editing capabilities.

Acknowledgements.

This work was supported by NVIDIA Taiwan AI Research & Development Center (TRDC). This research was funded by the National Science and Technology Council, Taiwan, under Grants NSTC 112-2222-E-A49-004-MY2 and 113-2628-E-A49-023-. Yu-Lun Liu acknowledges the Yushan Young Fellow Program by the MOE in Taiwan.

References

  • Barnes et al. [2009] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM TOG, 2009.
  • Barron et al. [2021] Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. Mip-NeRF: A multiscale representation for anti-aliasing neural radiance fields. In ICCV, 2021.
  • Barron et al. [2022] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In CVPR, 2022.
  • Bertalmio [2000] M Bertalmio. Image inpainting, 2000.
  • Cao et al. [2024] Chenjie Cao, Yunuo Cai, Qiaole Dong, Yikai Wang, and Yanwei Fu. Leftrefill: Filling right canvas based on left reference through generalized text-to-image diffusion model. In CVPR, 2024.
  • Chen et al. [2024] Yiwen Chen, Zilong Chen, Chi Zhang, Feng Wang, Xiaofeng Yang, Yikai Wang, Zhongang Cai, Lei Yang, Huaping Liu, and Guosheng Lin. Gaussianeditor: Swift and controllable 3d editing with gaussian splatting. In CVPR, 2024.
  • Cheng et al. [2024] Bo-Yu Cheng, Wei-Chen Chiu, and Yu-Lun Liu. Improving robustness for joint optimization of camera pose and decomposed low-rank tensorial radiance fields. In AAAI, 2024.
  • Cheng et al. [2023] Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, and Joon-Young Lee. Tracking anything with decoupled video segmentation. In ICCV, 2023.
  • Criminisi et al. [2004] Antonio Criminisi, Patrick Pérez, and Kentaro Toyama. Region filling and object removal by exemplar-based image inpainting. IEEE TIP, 2004.
  • Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
  • Fan et al. [2025] Cheng-De Fan, Chen-Wei Chang, Yi-Ruei Liu, Jie-Ying Lee, Jiun-Long Huang, Yu-Chee Tseng, and Yu-Lun Liu. Spectromotion: Dynamic 3d reconstruction of specular scenes. In CVPR, 2025.
  • Garbin et al. [2021] Stephan J. Garbin, Marek Kowalski, Matthew Johnson, Jamie Shotton, and Julien Valentin. FastNeRF: High-fidelity neural rendering at 200FPS. In ICCV, 2021.
  • Gehring et al. [2017] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. In ICML, 2017.
  • Haque et al. [2023] Ayaan Haque, Matthew Tancik, Alexei A. Efros, Aleksander Holynski, and Angjoo Kanazawa. Instruct-nerf2nerf: Editing 3d scenes with instructions. In ICCV, 2023.
  • Henzler et al. [2019] Philipp Henzler, Niloy J. Mitra, and Tobias Ritschel. Escaping Plato’s cave: 3D shape from adversarial rendering. In ICCV, 2019.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  • Huang et al. [2024] Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields. In ACM SIGGRAPH 2024 Conference Papers, 2024.
  • Jam et al. [2021] Jireh Jam, Connah Kendrick, Kevin Walker, Vincent Drouard, Jison Gee-Sern Hsu, and Moi Hoon Yap. A comprehensive review of past and present image inpainting methods. CVIU, 203:103147, 2021.
  • Ke et al. [2024] Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. In CVPR, 2024.
  • Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM TOG, 2023.
  • Li et al. [2023] Xin Li, Yulin Ren, Xin Jin, Cuiling Lan, Xingrui Wang, Wenjun Zeng, Xinchao Wang, and Zhibo Chen. Diffusion models for image restoration and enhancement–a comprehensive survey. arXiv preprint arXiv:2308.09388, 2023.
  • Lin et al. [2024] Chieh Hubert Lin, Changil Kim, Jia-Bin Huang, Qinbo Li, Chih-Yao Ma, Johannes Kopf, Ming-Hsuan Yang, and Hung-Yu Tseng. Taming latent diffusion model for neural radiance field inpainting. In ECCV, 2024.
  • Lin et al. [2025] Chin-Yang Lin, Chung-Ho Wu, Chang-Han Yeh, Shih-Han Yen, Cheng Sun, and Yu-Lun Liu. Frugalnerf: Fast convergence for few-shot novel view synthesis without learned priors. In CVPR, 2025.
  • Liu et al. [2018] Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. Image inpainting for irregular holes using partial convolutions. In ECCV, 2018.
  • Liu et al. [2022] Hao-Kang Liu, I-Chao Shen, and Bing-Yu Chen. NeRF-In: Free-form NeRF inpainting with RGB-D priors. In arXiv, 2022.
  • Liu et al. [2025] Kuan-Hung Liu, Cheng-Kun Yang, Min-Hung Chen, Yu-Lun Liu, and Yen-Yu Lin. Corrfill: Enhancing faithfulness in reference-based inpainting with correspondence guidance in diffusion models. In WACV, 2025.
  • Liu et al. [2020] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields. In NeurIPS, 2020.
  • Liu et al. [2023] Yu-Lun Liu, Chen Gao, Andreas Meuleman, Hung-Yu Tseng, Ayush Saraf, Changil Kim, Yung-Yu Chuang, Johannes Kopf, and Jia-Bin Huang. Robust dynamic radiance fields. In CVPR, 2023.
  • Liu et al. [2024] Zhiheng Liu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jie Xiao, Kai Zhu, Nan Xue, Yu Liu, Yujun Shen, and Yang Cao. Infusion: Inpainting 3d gaussians via learning depth completion from diffusion prior. arXiv preprint arXiv:2404.11613, 2024.
  • Lu et al. [2024] Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. In CVPR, 2024.
  • Luiten et al. [2024] Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. In 3DV, 2024.
  • Meng et al. [2022] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022.
  • Meuleman et al. [2023] Andreas Meuleman, Yu-Lun Liu, Chen Gao, Jia-Bin Huang, Changil Kim, Min H Kim, and Johannes Kopf. Progressively optimized local radiance fields for robust view synthesis. In CVPR, 2023.
  • Mildenhall et al. [2020] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  • Mirzaei et al. [2023a] Ashkan Mirzaei, Tristan Aumentado-Armstrong, Marcus A. Brubaker, Jonathan Kelly, Alex Levinshtein, Konstantinos G. Derpanis, and Igor Gilitschenski. Reference-guided controllable inpainting of neural radiance fields. In ICCV, 2023a.
  • Mirzaei et al. [2023b] Ashkan Mirzaei, Tristan Aumentado-Armstrong, Konstantinos G. Derpanis, Jonathan Kelly, Marcus A. Brubaker, Igor Gilitschenski, and Alex Levinshtein. SPIn-NeRF: Multiview segmentation and perceptual inpainting with neural radiance fields. In CVPR, 2023b.
  • Mirzaei et al. [2024] Ashkan Mirzaei, Riccardo De Lutio, Seung Wook Kim, David Acuna, Jonathan Kelly, Sanja Fidler, Igor Gilitschenski, and Zan Gojcic. Reffusion: Reference adapted diffusion models for 3d scene inpainting. arXiv preprint arXiv:2404.10765, 2024.
  • Miyake et al. [2023] Daiki Miyake, Akihiro Iohara, Yu Saito, and Toshiyuki Tanaka. Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models. arXiv preprint arXiv:2305.16807, 2023.
  • Mokady et al. [2023] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In CVPR, 2023.
  • Pathak et al. [2016] Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
  • Podell et al. [2023] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  • Prabhu et al. [2023] Kira Prabhu, Jane Wu, Lynn Tsai, Peter Hedman, Dan B Goldman, Ben Poole, and Michael Broxton. Inpaint3d: 3d scene content generation using 2d inpainting diffusion. arXiv preprint arXiv:2312.03869, 2023.
  • Qin et al. [2024] Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. In CVPR, 2024.
  • Qiu et al. [2024] Ri-Zhao Qiu, Ge Yang, Weijia Zeng, and Xiaolong Wang. Language-driven physics-based scene synthesis and editing via feature splatting. In ECCV, 2024.
  • Ravi et al. [2025] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos. In ICLR, 2025.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  • Sarlin et al. [2019] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. In CVPR, 2019.
  • Sarlin et al. [2020] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning feature matching with graph neural networks. In CVPR, 2020.
  • Schönberger and Frahm [2016] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In CVPR, 2016.
  • Schönberger et al. [2016] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. In ECCV, 2016.
  • Shen et al. [2024] I-Chao Shen, Hao-Kang Liu, and Bing-Yu Chen. Nerf-in: Free-form nerf inpainting with rgb-d priors. Computer Graphics and Applications (CG&A), 2024.
  • Song et al. [2021] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In ICLR, 2021.
  • Su et al. [2024] Chih-Hai Su, Chih-Yao Hu, Shr-Ruei Tsai, Jie-Ying Lee, Chin-Yang Lin, and Yu-Lun Liu. Boostmvsnerfs: Boosting mvs-based nerfs to generalizable view synthesis in large-scale scenes. In ACM SIGGRAPH 2024 Conference Papers, 2024.
  • Suvorov et al. [2022] Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with Fourier convolutions. In WACV, pages 2149–2159, 2022.
  • Tang et al. [2024] Luming Tang, Nataniel Ruiz, Qinghao Chu, Yuanzhen Li, Aleksander Holynski, David E Jacobs, Bharath Hariharan, Yael Pritch, Neal Wadhwa, Kfir Aberman, et al. Realfill: Reference-driven generation for authentic image completion. ACM TOG, 2024.
  • Tulsiani et al. [2017] Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, 2017.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
  • Wang et al. [2023] Can Wang, Ruixiang Jiang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. Nerf-art: Text-driven neural radiance fields stylization. IEEE TVCG, 2023.
  • Wang et al. [2024a] Dongqing Wang, Tong Zhang, Alaa Abboud, and Sabine Süsstrunk. Inpaintnerf360: Text-guided 3d inpainting on unbounded neural radiance fields. In CVPR, 2024a.
  • Wang et al. [2021] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srinivasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. IBRNet: Learning multi-view image-based rendering. In CVPR, 2021.
  • Wang et al. [2024b] Yuxin Wang, Qianyi Wu, Guofeng Zhang, and Dan Xu. Gscream: Learning 3d geometry and feature consistent gaussian splatting for object removal. In ECCV, 2024b.
  • Weber et al. [2024] Ethan Weber, Aleksander Holynski, Varun Jampani, Saurabh Saxena, Noah Snavely, Abhishek Kar, and Angjoo Kanazawa. Nerfiller: Completing scenes via generative 3d inpainting. In CVPR, 2024.
  • Weder et al. [2023] Silvan Weder, Guillermo Garcia-Hernando, Aron Monszpart, Marc Pollefeys, Gabriel Brostow, Michael Firman, and Sara Vicente. Removing objects from neural radiance fields. In CVPR, 2023.
  • Wu et al. [2024] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. In CVPR, 2024.
  • Yang et al. [2021] Bangbang Yang, Yinda Zhang, Yinghao Xu, Yijin Li, Han Zhou, Hujun Bao, Guofeng Zhang, and Zhaopeng Cui. Learning object-compositional neural radiance field for editable scene rendering. In ICCV, 2021.
  • Yang et al. [2024] Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. In CVPR, 2024.
  • Ye et al. [2024] Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. Gaussian grouping: Segment and edit anything in 3d scenes. In ECCV, 2024.
  • Yin et al. [2023] Youtan Yin, Zhoujie Fu, Fan Yang, and Guosheng Lin. Or-nerf: Object removing from 3d scenes guided by multiview segmentation with neural radiance fields. arXiv preprint arXiv:2305.10503, 2023.
  • Yu et al. [2021] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelNeRF: Neural radiance fields from one or few images. In CVPR, 2021.
  • Yu et al. [2025] Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T. Freeman, and Jiajun Wu. Wonderworld: Interactive 3d scene generation from a single image. In CVPR, 2025.
  • Yu et al. [2018] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. In CVPR, 2018.
  • Yu et al. [2019] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Free-form image inpainting with gated convolution. In ICCV, 2019.
  • Zhang et al. [2020] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun. Nerf++: Analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492, 2020.
  • Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.

Appendix A Overview

This supplementary material provides additional details and results to support the main manuscript. We first describe the training process for masked Gaussians and object removal in Section B, followed by an explanation of depth warping for bounding box generation in SAM2 [45] and its role in identifying unseen region contours in Section C. Next, we present ablations on different depth inpainting methods in Section D and a comparison of captured and inpainted references in Section E. We then outline the experimental setup in Section F and discuss the limitations of our approach in Section G. Finally, we provide additional visual comparisons in  Fig. 15 for the 360-UISD dataset and in  Fig. 16 for the other collected 360 dataset [3].

Appendix B Training Masked GS for Object Removal

During the training of masked Gaussians, we use 2DGS [17] as our codebase and introduce a masked attribute, ranging between 0 and 1, for each Gaussian. The L1 loss is computed between the object mask obtained via SAM2 [45] and the rasterized object mask for each training view. Additionally, we incorporate the Grouping Loss proposed by Gaussian Grouping [67], ensuring that neighboring Gaussians have similar masked attributes. This ensures that our Gaussian model retains accurate object mask information and is capable of rendering precise object masks for subsequent applications.

Thanks to the explicit nature of Gaussian Splatting, we can directly remove Gaussians with a masked attribute greater than a threshold τ𝜏\tauitalic_τ during the removal stage, effectively achieving object removal. In our implementation, τ𝜏\tauitalic_τ is set to 0.6.

Refer to caption
Figure 12: Intermediate Results of Depth Warping for Unseen Region Detection. This figure illustrates the intermediate results generated during the depth warping process. (a) and (b) show the RGB image and the corresponding removal region at view n𝑛nitalic_n, respectively. (c) displays the removal regions obtained from view i𝑖iitalic_i (in𝑖𝑛i\neq nitalic_i ≠ italic_n). (d) shows the unseen region obtained from view i𝑖iitalic_i through backward traversal. The intersections are concentrated near the unseen region. Note that the pixels within the unseen region, but with a value of zero, are due to the absence of Gaussians in that area, preventing depth rendering and thus making it impossible to establish pixel correspondences between view n𝑛nitalic_n and view i𝑖iitalic_i. (e) presents the aggregation of all unseen regions obtained from view i𝑖iitalic_i at view n𝑛nitalic_n. A threshold is applied to this result, and it is then intersected with the removal region at view n𝑛nitalic_n to obtain the final result in (f).

Appendix C Depth Warping for Unseen Contours

Following Sec. 3.2 and Fig. 4 of the main paper, we explain in detail how depth warping allows us to identify the contours of the unseen region, as illustrated in  Fig. 12. Without loss of generality, to find the unseen region contour at view n𝑛nitalic_n, and for each pair of views n𝑛nitalic_n and i𝑖iitalic_i, we first compute the removal region for view i𝑖iitalic_i by identifying pixels that differ between the rendered depth and the incomplete depth of view i𝑖iitalic_i rather than using object masks. This approach better captures geometric changes and prevents misalignment artifacts, leading to improved SAM2[45] prompts and more precise unseen masks (Fig. 13).

Next, we establish pixel correspondences between view n𝑛nitalic_n and view i𝑖iitalic_i using the incomplete depth of view n𝑛nitalic_n. The removal region of view i𝑖iitalic_i is then backward-traversed to view n𝑛nitalic_n based on these correspondences. During this backward traversal, it is important to note that pixels outside the unseen region in view i𝑖iitalic_i will correspond to the background areas in view n𝑛nitalic_n, while pixels belonging to the unseen region remain in the unseen region. By aggregating contributions from all views i𝑖iitalic_i (in𝑖𝑛i\neq nitalic_i ≠ italic_n), we project non-unseen regions from each view i𝑖iitalic_i into different areas of view n𝑛nitalic_n, while consolidating the unseen regions. This allows us to identify the contours of the unseen region in view n𝑛nitalic_n. These contours can then be used as the bounding box prompt for SAM2, resulting in a more accurate unseen mask.

Refer to caption
Figure 13: Ablation Study on Removal Region Definition. Comparison of (a) object masks vs. (b) depth difference for defining removal regions. Object masks fail to capture geometric changes, leading to less accurate unseen masks. Depth difference better preserves scene structure, improving SAM2 prompts and unseen region segmentation.

Appendix D Comparison of Depth Completion Methods

In addition to Fig. 11 of the main paper, we compare scale–shift alignment, LaMa [54], InFusion [29], GDD [70], and AGDD for depth completion. As shown in  Tab. 3, we evaluate the mean absolute difference (MAD) in object mask areas in 30 test views, using pseudo-GT depth from a 2DGS trained on 200 removal images, as mentioned in Sec. 4. Aligning scale-shift misaligns boundaries in 360° scenes, while LaMa provides reasonable depth completion but does not fully resolve alignment issues. AGDD achieves the lowest MAD and better handles complex geometry.

Table 3: MAD values for different depth completion methods.
Depth completion method MAD \downarrow
Scale-shift align 0.063
LaMa depth inpainting 0.077
InFuion 0.047
GDD 0.065
AGDD 0.045

Appendix E Reference Images in Real-World Use

Our 360-USID dataset provides real-world captured reference images. However, this does not mean that our method requires extra input. In practical scenarios, reference images can be captured post-removal for real-world use. We also ensure a fair evaluation by avoiding hallucinated textures, even if the inpainting is consistent. Additionally, reference guidance helps reduce multi-view inconsistency with minimal extra input. As shown in  Tab. 4, while LaMa-based references slightly degrade the results, they still outperform other reference-based methods, such as GScream. Even when using an inpainted image as a reference, our approach still achieves good results.

Table 4: Comparison of Captured and Inpainted Reference.
Reference method PSNR \uparrow SSIM \uparrow LPIPS \downarrow FID \downarrow
GScream 14.758 0.955 0.514 152.295
LaMa-reference 17.102 0.960 0.407 69.874
Captured-reference 17.661 0.961 0.388 62.173

Appendix F Experimetal Setup

F.1 LeftRefill [5]

We use the same reference image as in our method, along with the rendered object masks of each novel testing view generated by our masked Gaussians, as input to LeftRefill and directly perform reference-based inpainting on each testing novel view.

F.2 2DGS [17] + LaMa [54]

We provide the same reference image and training view object masks as in our method and use LaMa [54] to obtain per-frame inpainting results for each training view to train the 2DGS.

F.3 2DGS [17] + LeftRefill [5]

We provide the same reference image and training view object masks as in our method and use LeftRefill to obtain per-frame inpainting results for each training view to train the 2DGS.

Refer to caption
Figure 14: Failure Cases. The figure illustrates failure cases of inpainting results. These examples highlight the challenges of 3D inpainting when significant occlusions are present near the regions requiring inpainting. For instance, (b) and (c) demonstrate difficulties in achieving satisfactory guided inpainted RGB images in the training views, while (d) and (e) show errors resulting from incorrect pixel unprojections. These observations indicate that this issue is not effectively addressed by any of the compared methods, suggesting a potential avenue for further exploration and improvement.

F.4 SPIn-NeRF [36]

The original SPIn-NeRF [36] codebase is designed for forward-facing scenes; however, we adapt it for comparison on 360° scenes by implementing its approach on 2DGS [17]. We first obtain the depth for each training view by training a 2DGS model. Next, we generate inpainted RGB and depth maps using LaMa [54], which are then used to train the inpainted 2DGS model. During training, we follow SPIn-NeRF’s methodology by incorporating patch-based RGB-LPIPS loss and using the Pearson correlation coefficient to compute a scale- and shift-invariant depth loss.

F.5 Gscream [61]

We follow the original GScream [61] pipeline as a baseline for comparison. We provide the same reference image and training view object masks as our method to ensure consistency. Following their pipeline, we use Marigold [19] to generate estimated depths for all training images, meeting GScream’s input data requirements.

Refer to caption
Figure 15: Visual Comparison on our 360-USID dataset.

F.6 Gaussian Grouping [61]

We utilize the original Gaussian Grouping [67] codebase as a baseline for comparison. First, it generates segmentation IDs, from which we select the IDs corresponding to objects that require inpainting. These selected IDs are then used in the removal process. Following the original workflow, the unseen regions are identified, subsequently inpainted, and used for their fine-tuning process.

Notably, after removing objects from the scene, Gaussian Grouping relies on TrackingAnything-DEVA [8] to identify unseen regions requiring further inpainting through the ”black blurry hole” prompt. However, DEVA occasionally fails to accurately identify unseen regions in certain scenes, leading to incorrect inpainting and suboptimal results. Additionally, in some scenes, such as the bonsai scene from the Mip-NeRF-360 [3] dataset and the plant scene from the 360-UISD dataset, the object tracker misidentifies objects, resulting in incorrect object removal and further degrading the inpainting quality.

F.7 InFusion [29]

We use the original InFusion [29] codebase as a baseline for comparison. We provide the same reference image used in our method as the input RGB for its depth completion model. This reference image is also used in its fine-tuning process.

Appendix G Limitations

Our method successfully addresses complex, unbounded 360° scene inpainting. However, rendering the unprojected initial Gaussians and applying SDEdit [32] to enhance the guided inpainted RGB images can be time-consuming, particularly for high-resolution or large-scale scenes, which poses challenges for real-time applications. Furthermore, our analysis Fig. 14 shows that the method may produce incorrect pixel unprojections in cases with significant occlusions near the object requiring inpainting, resulting in floaters in the final inpainted outputs. This limitation is similarly observed across all compared methods, underscoring a valuable direction for future research and improvement.

Refer to caption
Figure 16: Visual Comparison on Other-360 dataset.