[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI

Created by
  • Haebom

Author

Benjamin Raphael Ernhofer, Daniil Prokhorov, Jannica Langner, Dominik Bollmann

Outline

This paper presents a vision-language framework that can adapt to various UI design changes of automotive infotainment systems. We release an open source dataset, AutomotiveUI-Bench-4K, which consists of 998 images and 4,208 annotations, and generate training data through a synthetic data pipeline. We fine-tune a Molmo-7B-based model using LoRa, and develop an Evaluative Large Action Model (ELAM) by integrating the inference, visual foundation, and evaluation functions generated from the pipeline. ELAM shows excellent performance on AutomotiveUI-Bench-4K, and in particular, achieves an average accuracy of 80.8%, which is +5.6% higher than the baseline model in the ScreenSpot task, and shows similar or higher performance than specialized models for desktop, mobile, and web. This study suggests the direction of AI development in the field of automotive UI understanding and interaction through data collection and fine-tuning, and provides a model that can be deployed on consumer-grade GPUs in a cost-effective manner.

Takeaways, Limitations

Takeaways:
Presenting an efficient vision-language framework for UI understanding and interaction in automotive infotainment systems
Enabling research through the release of the open source dataset AutomotiveUI-Bench-4K
Cost-effective model development and consumer-grade GPU deployment potential via LoRa-based fine-tuning
Strong generalization performance across a wide range of UI designs (achieves excellent performance on ScreenSpot)
Limitations:
Limited dataset size (998 images)
Lack of performance validation in real driving environments
Dependency on a specific model (Molmo-7B)
Further research is needed on the generalizability and limitations of synthetic data pipelines.
👍