NeRF-UAVeL: Unified Attention-driven Volumetric Learning for Enhanced NeRF-based 3D Object Detection

Hana L. Goshu1, Tadesse G. Wakjira2, Meklit M. Atlaw3, Kin-Chung Chan1, Songjiang Lai1, Kin-Man Lam1
1The Hong Kong Polytechnic University 2Kennesaw State University 3Xi'an Jiaotong University
Neurocomputing (Under Review)
Paper (coming soon) Code Data

NeRF-UAVeL produces accurate 3D bounding box predictions from NeRF-derived volumetric representations across synthetic and real-world indoor scenes.

Abstract

Three-dimensional (3D) object detection based on Neural Radiance Fields (NeRF) has emerged as a promising direction for reconstructing complex environments from posed RGB images. However, existing NeRF-based detectors often suffer from coarse feature encoding and limited attention to multi-scale volumetric structure, leading to inaccurate localization and poor generalization in real-world scenarios. To address these challenges, we propose NeRF-UAVeL, a unified attention-driven volumetric learning detection framework that integrates four novel modules into a NeRF-derived 3D volumetric backbone, namely, Multi-dimensional Volumetric Attention Pooling (MVAP), Tri-Scale Asymmetric Convolutional Aggregation (TACA), Dual-Domain Attention Fusion (DDAF), and Volumetric Cross-Window Attention Fusion (V-CWAF). MVAP enhances spatial selectivity through adaptive attention-based pooling, TACA captures multi-scale volumetric features through asymmetric convolutional branches, DDAF applies lightweight channel and spatial recalibration for refined feature emphasis, and V-CWAF injects cross-window attention with dual-stage channel recalibration to boost high-level semantic encoding. Extensive experiments on the 3D-FRONT and ScanNet datasets demonstrate that NeRF-UAVeL outperforms both point cloud-based and multi-view-based methods. Specifically, it improves AP50 by +6.7% and R50 by +7.3% over the baseline on 3D-FRONT, and achieves a +6.9% improvement in AP50 and +2.9% in R50 on the ScanNet datasets. These results confirm the effectiveness of our attention-calibrated, multi-scale volumetric architecture in producing precise and robust 3D bounding box predictions across both synthetic and real-world scenes.

Method

NeRF-UAVeL augments a NeRF-derived volumetric backbone with four attention-driven modules at different stages of the feature hierarchy.

MVAP (Multi-dimensional Volumetric Attention Pooling) — Enhances spatial selectivity through adaptive attention-based pooling across volumetric dimensions.

TACA (Tri-Scale Asymmetric Convolutional Aggregation) — Captures multi-scale volumetric features via asymmetric convolutional branches at three complementary scales.

DDAF (Dual-Domain Attention Fusion) — Applies lightweight channel and spatial recalibration for refined feature emphasis.

V-CWAF (Volumetric Cross-Window Attention Fusion) — Injects cross-window attention with dual-stage channel recalibration for high-level semantic encoding.

Main Results

Quantitative results on the 3D-FRONT and ScanNet datasets. The first block includes point-cloud-based methods, while the remaining entries are multi-view-based detection methods.

Method 3D-FRONT ScanNet
R25 R50 AP25 AP50 R25 R50 AP25 AP50
VoteNet 81.561.673.049.6 78.534.266.818.2
GroupFree 84.963.772.145.1 75.237.660.120.4
FCAF3D 89.156.973.135.2 90.242.463.718.5
ImVoxNet 88.371.586.166.4 51.720.237.39.8
NeRF-MAE* 97.274.585.363.0 92.039.557.117.0
NeRF-RPN 96.369.985.259.9 89.242.955.518.4
Ours 98.577.287.366.6 92.645.860.225.3

* NeRF-MAE results include pretraining on 3D-FRONT and two additional external datasets (Hypersim and HM3D).

Qualitative Visualization

3D-FRONT

Rotating 3D visualization of predicted bounding boxes on the 3D-FRONT dataset. Colored wireframe boxes indicate detected objects overlaid on the NeRF-derived volumetric scene representation.

ScanNet

Rotating 3D visualization of predicted bounding boxes on the ScanNet dataset. Our method accurately localizes objects in cluttered real-world indoor scenes.

BibTeX

@article{nerf_uavel2026,
    title     = {NeRF-UAVeL: Unified Attention-driven Volumetric Learning for Enhanced NeRF-based 3D Object Detection},
    author    = {Goshu, Hana L. and Wakjira, Tadesse G. and Atlaw, Meklit M. and Chan, Kin-Chung and Lai, Songjiang and Lam, Kin-Man},
    journal   = {Neurocomputing},
    year      = {2026},
    note      = {Under Review}
}

Acknowledgements

This work builds upon NeRF-RPN. We thank the authors for making their code and data publicly available.