Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Vision-Language Fusion for Real-Time Autonomous Driving: Goal-Centered Cross-Attention of Camera, HD-Map, & Waypoints

Created by
  • Haebom

Author

Santosh Patapati, Trisanth Srinivasan, Murari Ambati

Outline

XYZ-Drive is an autonomous driving system that uses a single vision-language model as input, taking a forward-facing camera frame, a 25m x 25m aerial map, and a next waypoint as input, and outputs steering and speed. Waypoint tokens support both action and textual descriptions using a lightweight, goal-focused cross-attention layer that highlights relevant image and map patches, and the fused tokens are fed into a partially fine-tuned LLaMA-3.2 11B model. On the MD-NEX Outdoor-Driving benchmark, it achieves a 95% success rate and a 0.80 success weighted by path length (SPL), a 15% improvement over PhysNav-DG, with half the number of collisions, and significantly improved efficiency by using only a single branch. We demonstrate this performance improvement through 16 ablation studies.

Takeaways, Limitations

Takeaways:
We demonstrate that early token-level fusion of vision, waypoints, and map information enables accurate, transparent, and real-time autonomous driving.
We demonstrate that a single vision-language model can simultaneously improve the accuracy and efficiency of autonomous driving.
We demonstrate that goal-driven attention mechanisms play a crucial role in effectively integrating supervised information.
It highlights the importance of fine-tuning when applying VLM to specific tasks (autonomous driving).
Limitations:
As map resolution decreases (from 10 cm to 40 cm), lane edges become blurry and collision rates increase, suggesting the need for higher-resolution maps.
Removing any one modality (Vision, Waypoints, Map) reduces success rates by up to 11%, making the reliance on complementary roles among modalities highly crucial. There is a need to improve the robustness between modalities.
👍