Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MetaOcc: Spatio-Temporal Fusion of Surround-View 4D Radar and Camera for 3D Occupancy Prediction with Dual Training Strategies

Created by
  • Haebom

Author

Long Yang, Lianqing Zheng, Wenjin Ai, Minghao Liu, Sen Li, Qunshu Lin, Shengyu Yan, Jie Bai, Zhixiong Ma, Tao Huang, Xichan Zhu

Outline

This paper presents MetaOcc, a multimodal framework for robust 3D occupancy prediction even in adverse weather conditions. MetaOcc performs omnidirectional 3D occupancy prediction using multi-view 4D radar and imagery. To overcome the limitations of directly applying LiDAR-based encoders to sparse radar data, we propose a Radar Height Self-Attention module that enhances vertical spatial inference and feature extraction. Furthermore, we mitigate spatiotemporal mismatch and enrich the fused feature representation through a hierarchical multi-scale multi-modal fusion strategy that performs adaptive local-global fusion across modalities and time. To reduce reliance on expensive point cloud annotations, we propose a pseudo-label generation pipeline based on an open-set segmenter, implementing a semi-supervised learning strategy that achieves 90% of the overall supervised performance using only 50% of the ground truth labels. Experimental results show that MetaOcc achieves state-of-the-art performance by improving the existing methods by +0.47 SC IoU and +4.02 mIoU on the OmniHD-Scenes dataset, and by +1.16 SC IoU and +1.24 mIoU on the SurroundOcc-nuScenes dataset.

Takeaways, Limitations

Takeaways:
An effective multi-modal fusion framework for robust 3D occupancy prediction even in adverse weather conditions is presented.
Proposal of a Radar Height Self-Attention Module for Effective Feature Extraction from Sparse Radar Data.
Proposing a Hierarchical Multi-scale Multi-modal Fusion Strategy for Spatiotemporal Mismatch Mitigation and Feature Representation Enrichment.
Reducing annotation costs and improving performance through semi-supervised learning strategies.
Achieving state-of-the-art performance on OmniHD-Scenes and SurroundOcc-nuScenes datasets.
Presenting practical applicability to actual autonomous driving systems.
Limitations:
The performance of the proposed pseudo-label generation pipeline is still somewhat lower (around 90%) than that of fully supervised learning.
Further validation of generalization performance across a variety of adverse weather conditions is needed.
Real-time performance evaluation in actual autonomous driving environments is required.
👍