NeRF-UAVeL: Unified Attention-driven Volumetric Learning for Enhanced NeRF-based 3D Object Detection

Hana L. Goshu¹, Tadesse G. Wakjira², Meklit M. Atlaw³, Kin-Chung Chan¹, Songjiang Lai¹, Kin-Man Lam¹

¹The Hong Kong Polytechnic University ²Kennesaw State University ³Xi'an Jiaotong University

Neurocomputing (Under Review)

NeRF-UAVeL produces accurate 3D bounding box predictions from NeRF-derived volumetric representations across synthetic and real-world indoor scenes.

Abstract

Three-dimensional (3D) object detection based on Neural Radiance Fields (NeRF) has emerged as a promising direction for reconstructing complex environments from posed RGB images. However, existing NeRF-based detectors often suffer from coarse feature encoding and limited attention to multi-scale volumetric structure, leading to inaccurate localization and poor generalization in real-world scenarios. To address these challenges, we propose NeRF-UAVeL, a unified attention-driven volumetric learning detection framework that integrates four novel modules into a NeRF-derived 3D volumetric backbone, namely, Multi-dimensional Volumetric Attention Pooling (MVAP), Tri-Scale Asymmetric Convolutional Aggregation (TACA), Dual-Domain Attention Fusion (DDAF), and Volumetric Cross-Window Attention Fusion (V-CWAF). MVAP enhances spatial selectivity through adaptive attention-based pooling, TACA captures multi-scale volumetric features through asymmetric convolutional branches, DDAF applies lightweight channel and spatial recalibration for refined feature emphasis, and V-CWAF injects cross-window attention with dual-stage channel recalibration to boost high-level semantic encoding. Extensive experiments on the 3D-FRONT and ScanNet datasets demonstrate that NeRF-UAVeL outperforms both point cloud-based and multi-view-based methods. Specifically, it improves AP₅₀ by +6.7% and R₅₀ by +7.3% over the baseline on 3D-FRONT, and achieves a +6.9% improvement in AP₅₀ and +2.9% in R₅₀ on the ScanNet datasets. These results confirm the effectiveness of our attention-calibrated, multi-scale volumetric architecture in producing precise and robust 3D bounding box predictions across both synthetic and real-world scenes.

Method

NeRF-UAVeL augments a NeRF-derived volumetric backbone with four attention-driven modules at different stages of the feature hierarchy.

MVAP (Multi-dimensional Volumetric Attention Pooling) — Enhances spatial selectivity through adaptive attention-based pooling across volumetric dimensions.

TACA (Tri-Scale Asymmetric Convolutional Aggregation) — Captures multi-scale volumetric features via asymmetric convolutional branches at three complementary scales.

DDAF (Dual-Domain Attention Fusion) — Applies lightweight channel and spatial recalibration for refined feature emphasis.

V-CWAF (Volumetric Cross-Window Attention Fusion) — Injects cross-window attention with dual-stage channel recalibration for high-level semantic encoding.

Main Results

Quantitative results on the 3D-FRONT and ScanNet datasets. The first block includes point-cloud-based methods, while the remaining entries are multi-view-based detection methods.

Method	3D-FRONT				ScanNet
Method	R₂₅	R₅₀	AP₂₅	AP₅₀	R₂₅	R₅₀	AP₂₅	AP₅₀
VoteNet	81.5	61.6	73.0	49.6	78.5	34.2	66.8	18.2
GroupFree	84.9	63.7	72.1	45.1	75.2	37.6	60.1	20.4
FCAF3D	89.1	56.9	73.1	35.2	90.2	42.4	63.7	18.5

ImVoxNet	88.3	71.5	86.1	66.4	51.7	20.2	37.3	9.8
NeRF-MAE*	97.2	74.5	85.3	63.0	92.0	39.5	57.1	17.0
NeRF-RPN	96.3	69.9	85.2	59.9	89.2	42.9	55.5	18.4

Ours	98.5	77.2	87.3	66.6	92.6	45.8	60.2	25.3

* NeRF-MAE results include pretraining on 3D-FRONT and two additional external datasets (Hypersim and HM3D).

Qualitative Visualization

3D-FRONT

Rotating 3D visualization of predicted bounding boxes on the 3D-FRONT dataset. Colored wireframe boxes indicate detected objects overlaid on the NeRF-derived volumetric scene representation.

ScanNet

Rotating 3D visualization of predicted bounding boxes on the ScanNet dataset. Our method accurately localizes objects in cluttered real-world indoor scenes.

BibTeX

@article{nerf_uavel2026,
    title     = {NeRF-UAVeL: Unified Attention-driven Volumetric Learning for Enhanced NeRF-based 3D Object Detection},
    author    = {Goshu, Hana L. and Wakjira, Tadesse G. and Atlaw, Meklit M. and Chan, Kin-Chung and Lai, Songjiang and Lam, Kin-Man},
    journal   = {Neurocomputing},
    year      = {2026},
    note      = {Under Review}
}