[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Visual Grounding Methods for Efficient Interaction with Desktop Graphical User Interfaces

Created by
  • Haebom

Author

El Hassane Ettifouri, Jessica Lopez Espejel, Laura Minkova, Tassnim Dardouri, Walid Dahhane

Outline

In this paper, we propose Instruction Visual Grounding (IVG) to solve the visual grounding problem in synthetic images such as graphical user interfaces (GUIs). Unlike previous visual grounding studies that mainly focus on realistic images, this paper focuses on finding the coordinates of target elements of command execution by receiving natural language commands and GUI screens as input. To this end, we propose two methods: IVGocr, which combines LLMs, object detection models, and OCR modules, and IVGdirect, an end-to-end approach using a multi-modal architecture. We also release dedicated datasets for each method. In addition, we propose CPV, a new evaluation metric that relaxes the existing CPS metric, and release the final test dataset to support future research.

Takeaways, Limitations

Takeaways:
IVG Proposal: A Novel Approach to Object Identification in GUI
Contribute to the development of AI agents for GUI automation interactions
Suggesting potential for advancement in software testing, accessibility, and HCI
Provides two IVG methods (IVGocr, IVGdirect) and dedicated datasets
Supporting future research by proposing a new evaluation metric CPV and providing open datasets
Limitations:
Further validation of the proposed method and its generalization performance on the dataset is needed.
Robustness evaluation across different GUI styles and complexities is needed
Limitations of CPV indicator and need for comparative analysis with other evaluation indicators
Need for performance evaluation and application research in actual GUI environments
👍