This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
MetaOcc: Spatio-Temporal Fusion of Surround-View 4D Radar and Camera for 3D Occupancy Prediction with Dual Training Strategies
Created by
Haebom
Author
Long Yang, Lianqing Zheng, Wenjin Ai, Minghao Liu, Sen Li, Qunshu Lin, Shengyu Yan, Jie Bai, Zhixiong Ma, Tao Huang, Xichan Zhu
Outline
This paper presents MetaOcc, a multimodal framework for robust 3D occupancy prediction even in adverse weather conditions. MetaOcc performs omnidirectional 3D occupancy prediction using multi-view 4D radar and imagery. To overcome the limitations of directly applying LiDAR-based encoders to sparse radar data, we propose a Radar Height Self-Attention module that enhances vertical spatial inference and feature extraction. Furthermore, we mitigate spatiotemporal mismatch and enrich the fused feature representation through a hierarchical multi-scale multi-modal fusion strategy that performs adaptive local-global fusion across modalities and time. To reduce reliance on expensive point cloud annotations, we propose a pseudo-label generation pipeline based on an open-set segmenter, implementing a semi-supervised learning strategy that achieves 90% of the overall supervised performance using only 50% of the ground truth labels. Experimental results show that MetaOcc achieves state-of-the-art performance by improving the existing methods by +0.47 SC IoU and +4.02 mIoU on the OmniHD-Scenes dataset, and by +1.16 SC IoU and +1.24 mIoU on the SurroundOcc-nuScenes dataset.
Takeaways, Limitations
•
Takeaways:
◦
An effective multi-modal fusion framework for robust 3D occupancy prediction even in adverse weather conditions is presented.
◦
Proposal of a Radar Height Self-Attention Module for Effective Feature Extraction from Sparse Radar Data.
◦
Proposing a Hierarchical Multi-scale Multi-modal Fusion Strategy for Spatiotemporal Mismatch Mitigation and Feature Representation Enrichment.
◦
Reducing annotation costs and improving performance through semi-supervised learning strategies.
◦
Achieving state-of-the-art performance on OmniHD-Scenes and SurroundOcc-nuScenes datasets.
◦
Presenting practical applicability to actual autonomous driving systems.
•
Limitations:
◦
The performance of the proposed pseudo-label generation pipeline is still somewhat lower (around 90%) than that of fully supervised learning.
◦
Further validation of generalization performance across a variety of adverse weather conditions is needed.
◦
Real-time performance evaluation in actual autonomous driving environments is required.