Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding

Created by
  • Haebom

Author

Wanfu Wang, Qipeng Huang, Guangquan Xue, Xiaobo Liang, Juntao Li

Outline

This paper proposes the LASER framework to address the problem of effective image region inference for Vision Language Models (VLMs), a key challenge in GUI grounding tasks under high-resolution inputs and complex multi-element visual interactions. LASER integrates Monte Carlo quality estimation and IoU-based region quality assessment to progressively empower VLMs with multi-level perceptual capabilities that improve both accuracy and diversity, enabling accurate coordinate prediction. This allows the model to focus on key regions relevant to instructions and adaptively allocate inference steps based on task complexity. Experimental results on the ScreenSpot Pro and ScreenSpot-v2 benchmarks demonstrate the effectiveness of LASER, demonstrating its performance among 7B-scale models. Specifically, LASER, fine-tuned on GTA1-7B, achieved a score of 55.7 on the ScreenSpot-Pro benchmark.

Takeaways, Limitations

Takeaways:
An Effective Framework for Improving the Multi-Level Perceptual Capability of VLMs (LASER)
Improved accuracy and diversity by combining Monte Carlo quality estimation and IoU-based evaluation.
Improved GUI grounding performance under high-resolution input and complex visual interactions.
Achieving new peak performance in 7B-scale models
Limitations:
LASER's performance improvements may be limited to specific benchmarks (ScreenSpot Pro, ScreenSpot-v2).
Further validation of generalization performance across different types of GUIs and tasks is needed.
Analysis of computational costs and efficiency is needed.
👍