We present Large Inverse Rendering Model (LIRM), a transformer architecture that jointly reconstructs high-quality shape, materials, and radiance fields with view-dependent effects in less than a second. Our model builds upon the recent Large Reconstruction Models (LRMs) that achieve state-of-the-art sparse-view reconstruction quality. However, existing LRMs struggle to reconstruct unseen parts accurately and cannot recover glossy appearance or generate relightable 3D contents that can be consumed by standard Graphics engines. To address these limitations, we make three key technical contributions to build a more practical multi-view 3D reconstruction framework. First, we introduce an update model that allows us to progressively add more input views to improve our reconstruction. Second, we propose a hexa-plane neural SDF representation to better recover detailed textures, geometry and material parameters. Third, we develop a novel neural directional-embedding mechanism to handle view-dependent effects. Trained on a large-scale shape and material dataset with a tailored coarse-to-fine training scheme, our model achieves compelling results. It compares favorably to optimization-based dense-view inverse rendering methods in terms of geometry and relighting accuracy, while requiring only a fraction of the inference time.
The network architecture of LIRM is shown above. The inputs are masked images, background images to provide more lighting information, and Plucker rays that encodes camera intrinsics and extrinsics. These 3 images are concatenated together and turned into tokens through a simple linear layer. These tokens are sent to a transformer that consists of 24 self-attention block to update hexa-plane tokens and NDE tokens. We decode the 2 kinds of tokens into the hexa-plane representation and NDE panoramas through linear layers, which can be used to render view dependent radiance fields and BRDF parameters through neural volume rendering. The decoded SDF volume can be used to extract accurate triangular mesh through standard marching cube.
LIRM can progressively improve reconstruction quality with more input images, enabling interactive reconstruction. The above shows an example where the first set of 4 input images only cover the front side of the object and the second set of 4 images only cover the back side. With the first set of 4 input images, our LIRM onlyreconstructs the front side of the object accurately. After taking the second set of inputs, our network updates the tri-plane prediction to obtain high-quality reconstruction of the full 3D object.
Our update model can generalize to challenging changing scene scenarios without any fine-tuning. In the example above, we first use LIRM to reconstruct the teddy bear from 4 input images. Then we add a small figurine behind the teddy bear and take another 4 images as the second set of inputs. Our model successfully utilizes new input images to reconstruct the added figurine while keeping the front side of the teddy bear unchanged, even though the front side is not observed in these new inputs.
While tri-plane-based 3D representation can achieve highly detailed reconstruction in most scenarios, we observe that it struggles when both sides of an object contain complex but different textures, as shown above. This limitation arises from using a single feature plane to represent both sides of textures. Therefore, we adopt a hexa-plane representation where we use 6 planes to divide the bounding box into 8 volumes, each with its own tri-plane. This simple modification prevents texture patterns from "leaking" to the other side, as shown in the above examples from the Google Scanned Object dataset. Note that we reduce the resolution of hexa-plane so that the total computational cost remains roughly the same.
Existing feedforward sparse-view reconstruction methods neglect view-dependent effects. We tackle this problem by including neural directional encoding (NDE) from Wu et al. into LIRM. Experiments on both synthetic and real data show that LIRM can better recover specular highlights visible in the input images, as shown in the example above.
By combining the above technical components, LIRM achieves impressive high-quality sparse-view inverse rendering on real-world data. Above we show quanlitative and quantitative results on the widely-used Stanford-ORB dataset. Our model achieves reconstruction quality on par or even better than SOTA optimization-based methods that take dense views as inputs and several hours to run. It also achieves the highest geometry reconstruction quality compared to optimization-based methods. We observe that LIRM can better handle specular materials, creating specular highlights on the surface that can closely match those of ground-truth images, while optimization-based methods either miss the specular highlights or cannot match the ground-truths accurately. Since we encode background images as inputs, LIRM can also better separate lighting color from the material color, as shown in the second row. We also compare to concurrent LRM-based inverse rendering method MetaLRM. LIRM outperforms by a large margin both qualitatively and quantitatively.
@inproceedings{li2025lirm,
title={LIRM: Large Inverse Rendering Model for Progressive Reconstruction of Shape, Materials and View-dependent Radiance Fields},
author={Zhengqin Li and Dilin Wang and Ka Chen and Zhaoyang Lv and Thu Nguyen-Phuoc and Milim Lee and Jia-bin Huang
and Lei Xiao and Cheng Zhang and Yufeng Zhu and Carl S. Marshall and Yufeng Ren and Richard Newcombe and Zhao Dong},
journal={Proceedings of the IEEE conference on computer vision and pattern recognition},
year={2025},
}