Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

Created by
  • Haebom

Author

Jaewoo Ahn, Junseo Kim, Heeseung Yun, Jaehyeon Son, Dongmin Park, Jaewoong Cho, Gunhee Kim

FlashAdventure: A Benchmark for LLM-Driven GUI Agents in Adventure Games

Outline

This paper introduces FlashAdventure, a novel benchmark for evaluating the performance of LLM-based GUI agents. FlashAdventure consists of 34 Flash-based adventure games, with the goal of completing the entire story. It is designed to address the "observation-action gap" problem, which involves remembering and utilizing long-term gameplay information. We propose CUA-as-a-Judge, an automatic gameplay evaluator, and COAST, an agent framework that leverages long-term cue memory to plan and solve sequential tasks. Experimental results show that while current GUI agents struggle to complete the entire story, COAST improves milestone completion rates by bridging the observation-action gap.

Takeaways, Limitations

Takeaways:
FlashAdventure, a new adventure game benchmark, allows us to evaluate the story-completion abilities of GUI agents.
Proposing and demonstrating the effectiveness of the COAST framework to address the observation-action gap problem.
Development of an automated gameplay evaluator, CUA-as-a-Judge.
Limitations:
Currently, GUI agents have a large performance gap compared to human players.
This gap must be narrowed through continued research efforts.
👍