Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving

Created by
  • Haebom

Author

Sheng Yang, Tong Zhan, Guancheng Chen, Yanfeng Lu, Jian Wang

Max-V1: End-to-End Autonomous Driving via Language-Based Trajectory Prediction

Outline

In this study, we reconceptualize autonomous driving in a generalized language and formalize the trajectory planning task as predicting the next waypoint. Max-V1 is a novel framework for single-step end-to-end autonomous driving. It presents a single-pass generation paradigm that matches the inherent sequential nature of driving. This approach leverages the generative power of the Vision-Language Model (VLM) to enable direct end-to-end trajectory prediction from front-facing camera input. The effectiveness of this method is underpinned by a principled supervision strategy derived from statistical modeling. This provides a well-defined learning objective, making it well-suited for mastering complex driving policies through imitation learning from large-scale expert demonstrations. Empirically, this method achieves state-of-the-art performance on the nuScenes dataset, providing an overall improvement of over 30% over previous baselines. Furthermore, it demonstrates excellent generalization performance on cross-domain datasets acquired from various vehicles, demonstrating remarkable potential for cross-vehicle robustness and adaptability. These empirical strengths pave the way for the development of more robust autonomous driving agents by introducing a model that enables fundamental driving behavior. The code will be provided with the publication.

Takeaways, Limitations

Solving autonomous driving problems using a single-pass generation paradigm.
We present an end-to-end framework that performs trajectory prediction directly from forward camera input using VLM.
Achieved state-of-the-art performance on the nuScenes dataset, with over 30% improvement over existing methods.
Demonstrates excellent generalization performance on cross-domain datasets, demonstrating cross-vehicle robustness and adaptability.
Laying the foundation for model development (code to be released).
There is no specific mention of Limitations in the paper.
👍