This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding
Created by
Haebom
Author
ZongHan Hsieh, Tzer-Jen Wei, ShengJing Yang
Outline
ZonUI-3B is a lightweight vision-language model (VLM) that is fully trainable on a single consumer-grade GPU (RTX 4090), and it performs comparable to much larger models on GUI grounding tasks. It addresses the data shortage problem in high-resolution desktop environments by using a cross-platform, multi-resolution dataset of 24K examples from various sources including mobile, desktop, and web GUI screenshots. It improves model adaptability by using a two-step fine-tuning strategy of cross-platform initial learning and special fine-tuning on high-resolution data, and demonstrates that data diversity is more important than quantity through a redundancy reduction strategy. It achieves excellent accuracy (84.9% for ScreenSpot, 86.4% for ScreenSpot-v2, and 86.4% for ScreenSpot-Pro) on benchmarks including ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro, outperforming existing models with less than 4B parameters. Ablation studies verify that balanced sampling and two-step fine-tuning play an important role in improving robustness in high-resolution desktop scenarios. The model is available at https://github.com/Han1018/ZonUI-3B .