This paper presents a vision-language framework that can adapt to various UI design changes of automotive infotainment systems. We release an open source dataset, AutomotiveUI-Bench-4K, which consists of 998 images and 4,208 annotations, and generate training data through a synthetic data pipeline. We fine-tune a Molmo-7B-based model using LoRa, and develop an Evaluative Large Action Model (ELAM) by integrating the inference, visual foundation, and evaluation functions generated from the pipeline. ELAM shows excellent performance on AutomotiveUI-Bench-4K, and in particular, achieves an average accuracy of 80.8%, which is +5.6% higher than the baseline model in the ScreenSpot task, and shows similar or higher performance than specialized models for desktop, mobile, and web. This study suggests the direction of AI development in the field of automotive UI understanding and interaction through data collection and fine-tuning, and provides a model that can be deployed on consumer-grade GPUs in a cost-effective manner.