Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models

Created by
  • Haebom

Author

Sung-Yeon Park, Can Cui, Yunsheng Ma, Ahmadreza Moradipari, Rohit Gupta, Kyungtae Han, Ziran Wang

Outline

This paper presents NuPlanQA-Eval, a novel benchmark for evaluating the driving scene understanding capabilities of multimodal large-scale language models (MLLMs), and the large-scale dataset NuPlanQA-1M. NuPlanQA-1M consists of 1 million real-world visual question-answering (VQA) pairs, categorized into nine subtasks across three core skills: road environment recognition, spatial relationship recognition, and egocentric reasoning. Furthermore, we propose BEV-LLM, which integrates bird's-eye view (BEV) features from multi-view images into MLLM, demonstrating that conventional MLLMs struggle with driving scene-specific recognition and spatial reasoning from egocentric perspectives. BEV-LLM outperforms other models in six of the nine subtasks, demonstrating that incorporating BEV improves the performance of multi-view MLLMs. The NuPlanQA dataset is publicly available.

Takeaways, Limitations

Takeaways:
We present a new benchmark (NuPlanQA-Eval) and a large-scale dataset (NuPlanQA-1M) for multi-view, multi-modal driving scene understanding.
We present the possibility of improving the driving scene understanding performance of MLLM by integrating BEV features (BEV-LLM).
Clarifying the Limitations of driving scene recognition and spatial reasoning of existing MLLM.
We anticipate that further research will be stimulated through the use of publicly available datasets.
Limitations:
The performance improvement of the proposed BEV-LLM may be limited to certain datasets.
Further research is needed to generalize performance across a variety of driving environments and situations.
The fact that it performed worse than other models in three of the nine subtasks indicates that future improvements are needed.
👍