Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Vision-Language Cross-Attention for Real-Time Autonomous Driving

Created by
  • Haebom

Author

Santosh Patapati, Trisanth Srinivasan, Murari Ambati

XYZ-Drive: A Single Vision-Language Model-Based Autonomous Driving System

Outline

XYZ-Drive is a single vision-language model for autonomous vehicles that require geometric accuracy and semantic understanding for autonomous driving in complex environments. It takes as input a forward-facing camera frame, a 25m x 25m aerial map, and the next waypoint, and outputs steering and speed. A lightweight, goal-focused cross-attention layer highlights the relevant image and map patches for waypoint tokens, and the fused tokens are fed into a partially fine-tuned LLaMA-3.2 11B model. On the MD-NEX outdoor driving benchmark, XYZ-Drive achieved a 95% success rate and a 0.80 SPL (Success Weighted by Path Length), outperforming PhysNav-DG by 15% and reducing collisions by half.

Takeaways, Limitations

A single model architecture can perform autonomous driving tasks.
Improved performance through early fusion of various modalities (visual information, waypoints, maps).
Demonstrating the importance of goal-centered attention mechanisms
Emphasize the need for fine-tuning of VLMs (Vision-Language Models)
Check the impact of map resolution on performance
Increased efficiency by using a single branch
Limitations not yet presented
👍