Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection

Created by
  • Haebom

Author

Xiang Li, Zhangchi Hu, Xiao Xu, Bin Kong

Outline

This paper presents a method for integrating LiDAR and camera inputs into a unified Bird's-Eye-View (BEV) representation to enhance the 3D perception performance of autonomous vehicles. Existing methods suffer from spatial misalignment between LiDAR and camera features, which leads to errors in accurate depth supervision of camera branches and cross-modal feature aggregation. This paper demonstrates that the root causes of these misalignments lie in calibration inaccuracies and projection errors caused by the rolling shutter effect. We note that these errors are predictably concentrated at object-background boundaries, which 2D detectors reliably identify. Therefore, our primary goal is to leverage 2D object prior information to pre-align cross-modal features before fusion. To address local misalignment, we propose Prior-Guided Depth Calibration (PGDC), which utilizes 2D prior information to mitigate misalignment and maintain accurate cross-modal feature pairs. To address global alignment errors, we introduce Discontinuity-Aware Geometric Fusion (DAGF), which suppresses residual noise from PGDC and explicitly enhances distinct depth variations at object-background boundaries to generate structurally recognizable representations. To effectively utilize the aligned representations, we integrate the Structural Guidance Depth Modulator (SGDM), which efficiently fuses aligned depth and image features using a gated attention mechanism. The proposed method achieves state-of-the-art performance (mAP 71.5%, NDS 73.6%) on the nuScenes validation dataset.

Takeaways, Limitations

Takeaways:
Presenting an effective solution to the spatial alignment error problem that occurs when fusing LiDAR and camera data.
Improving the accuracy of cross-modal feature alignment by leveraging 2D object prior information.
Structural recognition and accuracy improvement of BEV representation through PGDC, DAGF, and SGDM modules.
Achieving SOTA performance on the nuScenes dataset
Limitations:
The performance of the proposed method may be limited to a specific dataset (nuScenes).
It may depend on the performance of the 2D object detector, meaning that errors in the 2D detector may affect the performance of the entire system.
Further verification of generalization performance in real-world autonomous driving environments is needed.
Further research is needed on computational complexity and real-time processing capabilities.
👍