NeRF-UAVeL produces accurate 3D bounding box predictions from NeRF-derived volumetric representations across synthetic and real-world indoor scenes.
Three-dimensional (3D) object detection based on Neural Radiance Fields (NeRF) has emerged as a promising direction for reconstructing complex environments from posed RGB images. However, existing NeRF-based detectors often suffer from coarse feature encoding and limited attention to multi-scale volumetric structure, leading to inaccurate localization and poor generalization in real-world scenarios. To address these challenges, we propose NeRF-UAVeL, a unified attention-driven volumetric learning detection framework that integrates four novel modules into a NeRF-derived 3D volumetric backbone, namely, Multi-dimensional Volumetric Attention Pooling (MVAP), Tri-Scale Asymmetric Convolutional Aggregation (TACA), Dual-Domain Attention Fusion (DDAF), and Volumetric Cross-Window Attention Fusion (V-CWAF). MVAP enhances spatial selectivity through adaptive attention-based pooling, TACA captures multi-scale volumetric features through asymmetric convolutional branches, DDAF applies lightweight channel and spatial recalibration for refined feature emphasis, and V-CWAF injects cross-window attention with dual-stage channel recalibration to boost high-level semantic encoding. Extensive experiments on the 3D-FRONT and ScanNet datasets demonstrate that NeRF-UAVeL outperforms both point cloud-based and multi-view-based methods. Specifically, it improves AP50 by +6.7% and R50 by +7.3% over the baseline on 3D-FRONT, and achieves a +6.9% improvement in AP50 and +2.9% in R50 on the ScanNet datasets. These results confirm the effectiveness of our attention-calibrated, multi-scale volumetric architecture in producing precise and robust 3D bounding box predictions across both synthetic and real-world scenes.
NeRF-UAVeL augments a NeRF-derived volumetric backbone with four attention-driven modules at different stages of the feature hierarchy.
MVAP (Multi-dimensional Volumetric Attention Pooling) — Enhances spatial selectivity through adaptive attention-based pooling across volumetric dimensions.
TACA (Tri-Scale Asymmetric Convolutional Aggregation) — Captures multi-scale volumetric features via asymmetric convolutional branches at three complementary scales.
DDAF (Dual-Domain Attention Fusion) — Applies lightweight channel and spatial recalibration for refined feature emphasis.
V-CWAF (Volumetric Cross-Window Attention Fusion) — Injects cross-window attention with dual-stage channel recalibration for high-level semantic encoding.
Quantitative results on the 3D-FRONT and ScanNet datasets. The first block includes point-cloud-based methods, while the remaining entries are multi-view-based detection methods.
| Method | 3D-FRONT | ScanNet | ||||||
|---|---|---|---|---|---|---|---|---|
| R25 | R50 | AP25 | AP50 | R25 | R50 | AP25 | AP50 | |
| VoteNet | 81.5 | 61.6 | 73.0 | 49.6 | 78.5 | 34.2 | 66.8 | 18.2 |
| GroupFree | 84.9 | 63.7 | 72.1 | 45.1 | 75.2 | 37.6 | 60.1 | 20.4 |
| FCAF3D | 89.1 | 56.9 | 73.1 | 35.2 | 90.2 | 42.4 | 63.7 | 18.5 |
| ImVoxNet | 88.3 | 71.5 | 86.1 | 66.4 | 51.7 | 20.2 | 37.3 | 9.8 |
| NeRF-MAE* | 97.2 | 74.5 | 85.3 | 63.0 | 92.0 | 39.5 | 57.1 | 17.0 |
| NeRF-RPN | 96.3 | 69.9 | 85.2 | 59.9 | 89.2 | 42.9 | 55.5 | 18.4 |
| Ours | 98.5 | 77.2 | 87.3 | 66.6 | 92.6 | 45.8 | 60.2 | 25.3 |
* NeRF-MAE results include pretraining on 3D-FRONT and two additional external datasets (Hypersim and HM3D).
Rotating 3D visualization of predicted bounding boxes on the 3D-FRONT dataset. Colored wireframe boxes indicate detected objects overlaid on the NeRF-derived volumetric scene representation.
Rotating 3D visualization of predicted bounding boxes on the ScanNet dataset. Our method accurately localizes objects in cluttered real-world indoor scenes.
@article{nerf_uavel2026,
title = {NeRF-UAVeL: Unified Attention-driven Volumetric Learning for Enhanced NeRF-based 3D Object Detection},
author = {Goshu, Hana L. and Wakjira, Tadesse G. and Atlaw, Meklit M. and Chan, Kin-Chung and Lai, Songjiang and Lam, Kin-Man},
journal = {Neurocomputing},
year = {2026},
note = {Under Review}
}
This work builds upon NeRF-RPN. We thank the authors for making their code and data publicly available.