LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows




Meta Reality Labs, Research  

Overview

We introduce the Large Sparse Reconstruction Model to study how scaling transformer context windows impacts feed-forward 3D reconstruction. Although recent object-centric feed-forward methods deliver robust, high-quality reconstruction, they still lag behind dense-view optimization in recovering fine-grained texture and appearance. We show that expanding the context window---by substantially increasing the number of active object and image tokens---remarkably narrows this gap and enables high-fidelity 3D object reconstruction and inverse rendering. To scale effectively, we adapt native sparse attention in our architecture design, unlocking its capacity for 3D reconstruction with three key contributions: (1) an efficient coarse-to-fine pipeline that focuses computation on informative regions by predicting sparse high-resolution residuals; (2) a 3D-aware spatial routing mechanism that establishes accurate 2D-3D correspondences using explicit geometric distances rather than standard attention scores; and (3) a custom block-aware sequence parallelism strategy utilizing an All-gather-KV protocol to perfectly balance dynamic, sparse workloads across GPUs. As a result, LSRM handles 20× more object tokens and >  more image tokens than prior state-of-the-art (SOTA) methods. Extensive evaluations on standard novel-view synthesis benchmarks show substantial gains over the current SOTA, yielding > 2.4 dB higher PSNR and > 40% lower LPIPS. Furthermore, when extending LSRM to inverse rendering tasks, qualitative and quantitative evaluations on widely-used benchmarks demonstrate consistent improvements in texture and geometry details, achieving an LPIPS that matches or exceeds that of SOTA dense-view optimization methods.

Network architecture of LSRM is shown above. Our method employs a two-stage coarse-to-fine pipeline. In Stage 1, a Dense Reconstruction Transformer generates a coarse, low-resolution volume. Stage 2 utilizes this coarse volume to initialize the active sparse volume tokens and predicts high-resolution sparse residuals, constructing the final 3D representation.

Novel-View Synthesis

Drag the slider to compare. Left = LIRM · Right = LSRM

More view synthesis results →

Inverse Rendering & Relighting

Drag the slider to compare. Use buttons to switch methods. Left = selected method · Right = LSRM

More relighting results →

Relighting Videos

Drag the slider to compare. Left = LIRM · Right = LSRM

More relighting videos →