[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding

Created by
  • Haebom

Author

ZongHan Hsieh, Tzer-Jen Wei, ShengJing Yang

Outline

ZonUI-3B is a lightweight vision-language model (VLM) that is fully trainable on a single consumer-grade GPU (RTX 4090), and it performs comparable to much larger models on GUI grounding tasks. It addresses the data shortage problem in high-resolution desktop environments by using a cross-platform, multi-resolution dataset of 24K examples from various sources including mobile, desktop, and web GUI screenshots. It improves model adaptability by using a two-step fine-tuning strategy of cross-platform initial learning and special fine-tuning on high-resolution data, and demonstrates that data diversity is more important than quantity through a redundancy reduction strategy. It achieves excellent accuracy (84.9% for ScreenSpot, 86.4% for ScreenSpot-v2, and 86.4% for ScreenSpot-Pro) on benchmarks including ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro, outperforming existing models with less than 4B parameters. Ablation studies verify that balanced sampling and two-step fine-tuning play an important role in improving robustness in high-resolution desktop scenarios. The model is available at https://github.com/Han1018/ZonUI-3B .

Takeaways, Limitations

Takeaways:
Successful development of a lightweight VLM that achieves performance similar to large-scale models on a single consumer-grade GPU.
Effective GUI understanding and adaptability enhancement through cross-platform, multi-resolution datasets and a two-step fine-tuning strategy.
Emphasize the importance of data diversity and increase data efficiency by reducing redundancy.
Achieved excellent performance in GUI grounding tasks (ScreenSpot 84.9%, ScreenSpot-v2 86.4%).
Improving accessibility through open source disclosure.
Limitations:
The dataset size may still be limited (24K examples). There may be room for performance improvement when using a larger dataset.
Additional research may be needed on generalization performance for specific types of GUIs or specific resolutions.
Although it shows performance advantages compared to models with less than 4B parameters, comparative analysis with much larger models may be lacking.
Additional performance evaluation in real application environments is required.
👍