Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Goal-Based Vision-Language Driving

Created by
  • Haebom

Author

Santosh Patapati, Trisanth Srinivasan

Outline

NovaDrive is a single-branch vision-language architecture that processes front camera images, HD map tiles, LiDAR depth, and text-based waypoints in a single branch for autonomous driving in complex situations. It uses a lightweight two-stage cross-attention block to first align waypoint tokens to the HD map, then improves attention to fine-grained image and depth patches. This, combined with a novel smoothing loss that prevents abrupt steering and velocity changes, eliminates the need for circular memory. It fine-tunes the top 15 layers of the 11B LLaMA-3.2 vision-language backbone to enable real-time inference. On the nuScenes/Waymo subset of the MD-NEX Outdoor benchmark, NovaDrive increases the success rate to 84% (+4%), improves path efficiency (SPL) to 0.66 (+0.11), and reduces collision frequency from 2.6% to 1.2% (-1.4%). These gains are primarily attributed to waypoint tokens, partial VLM fine-tuning, and cross-attention fusion.

Takeaways, Limitations

Takeaways:
Improving autonomous driving safety and efficiency (improving success rates, route efficiency, and crash frequency).
Reduced fuel or battery usage (shorter route).
Possibility of a lighter and more easily updatable driving stack.
Potential for expansion into other specific AI domains.
Limitations:
It is difficult to determine the specific Limitations based on the information provided. (You will need to read the entire paper to find out.)
👍