Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

EMMA: End-to-End Multimodal Model for Autonomous Driving

Created by
  • Haebom

Author

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, Yin Zhou, James Guo, Dragomir Anguelov, Mingxing Tan

Outline

EMMA is an end-to-end multimodal model for autonomous driving based on a multimodal giant language model like Gemini. EMMA directly maps raw camera sensor data to various driving-related outputs, such as planner paths, recognized objects, and road graph elements. It maximizes the global knowledge of the pre-trained giant language model by expressing both non-sensor inputs, such as navigation instructions and vehicle status, and outputs, such as paths and 3D positions, as natural language text. This allows EMMA to jointly process various driving tasks within a unified language space and generate outputs for each task using task-specific prompts. Its effectiveness has been experimentally demonstrated by achieving competitive results in motion planning on nuScenes, in WOMD, and in camera-based 3D object detection on WOD. Jointly training EMMA on planner paths, object detection, and road graph tasks improves performance across all three domains, highlighting EMMA's potential as a generalizable model for autonomous driving applications.

Takeaways, Limitations

Takeaways:
We present a novel architecture that comprehensively handles various tasks of autonomous driving based on a multimodal giant language model.
Achieve cutting-edge or competitive performance in nuScenes and WOMD.
We have seen improvements in overall performance through joint training of various tasks.
Presenting new research directions for the development of autonomous driving model architectures.
Limitations:
The paper lacks specific reference to Limitations.
Further validation of generalization performance in real road environments is needed.
Lack of evaluation of energy efficiency and real-time processing performance.
👍