Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning

Created by
  • Haebom

Author

Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qingwen Ye, Ming-Hsuan Yang, Yiwei Wang

Outline

This paper addresses the problem of natural language query-based modeling in graphical user interfaces (GUIs), which often suffer from a variety of visual elements, spatial clutter, and linguistic ambiguity. We present a training-free GUI-based framework, DiMo-GUI, which leverages two core strategies: dynamic visual-based and modality-aware optimization. Instead of processing the GUI as a single image, the input is split into textual and iconographic elements, and a common vision-language model is used to independently infer each modality. When the prediction is ambiguous or incorrect, DiMo-GUI dynamically focuses attention by generating candidate focus regions centered on the model's initial prediction and progressively expanding to subregions to improve the underlying results. This hierarchical refinement process helps resolve ambiguity in visually cluttered layouts without additional training or annotation. We evaluate our approach on standard GUI-based benchmarks and demonstrate consistent improvements over baseline inference pipelines, highlighting the effectiveness of combining modality separation and region-centric inference.

Takeaways, Limitations

Takeaways:
We present a GUI-based framework that requires no learning, reducing the cost of data collection and annotation.
By combining modality separation and domain-centric reasoning, we can effectively process natural language queries even in visually cluttered GUIs.
We experimentally demonstrate that it improves performance over existing inference pipelines.
Limitations:
Further research is needed to determine the generalization performance of the proposed method. Its robustness to various GUI designs and complexities should be further verified.
Performance may degrade for certain types of GUIs or queries. More extensive experimentation is needed to better understand Limitations.
Further analysis is needed on processing performance and efficiency for complex GUIs or ambiguous queries.
👍