Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI

Created by
  • Haebom

Author

Benjamin Raphael Ernhofer, Daniil Prokhorov, Jannica Langner, Dominik Bollmann

Outline

This paper presents a vision-language framework that provides an intelligent and adaptive solution for adapting to diverse UI design changes in automotive infotainment systems. It facilitates understanding and interaction with automotive UIs, enabling seamless adaptation across diverse UI designs. To achieve this, we release the AutomotiveUI-Bench-4K open-source dataset, consisting of 998 images and 4,208 annotations, and present a data pipeline for training data generation. We fine-tune a Molmo-7B-based model using LoRa (Low-Rank Adaptation) and develop an Evaluative Large Action Model (ELAM) by integrating visual-based and evaluation functions. The developed ELAM achieves high performance on AutomotiveUI-Bench-4K, and in particular, outperforms the baseline model by 5.6% on the ScreenSpot task (average accuracy of 80.8%). It performs similarly or better than specialized models for desktop, mobile, and web platforms, and despite being primarily trained in the automotive domain, it demonstrates excellent domain generalization. This study presents a direction for AI-based advancements in automotive UI understanding and interaction through data collection and fine-tuning, providing a fine-tuned model that can be deployed on consumer-grade GPUs in a cost-effective manner.

Takeaways, Limitations

Takeaways:
Presenting a vision-language framework that can adapt to various design changes in automotive UI.
Enabling Research by Releasing the Open-Source Dataset AutomotiveUI-Bench-4K
Presenting a cost-effective LoRa-based fine-tuning method and verifying the feasibility of deploying consumer-grade GPUs.
Demonstrated improved performance and superior domain generalization capabilities compared to existing models in the ScreenSpot task.
AI-based advancements in understanding and interacting with automotive UIs are presented.
Limitations:
The dataset size needs to be expanded further (998 images may not be enough to adequately reflect various situations).
There is a possibility of bias towards specific car UI designs.
Lack of performance verification in real-world driving environments
Possible lack of support for various languages and consideration of cultural differences
Further research is needed on performance degradation and stability that may occur with long-term use.
👍