This paper introduces FlashAdventure, a novel benchmark for evaluating the performance of LLM-based GUI agents. FlashAdventure consists of 34 Flash-based adventure games, with the goal of completing the entire story. It is designed to address the "observation-action gap" problem, which involves remembering and utilizing long-term gameplay information. We propose CUA-as-a-Judge, an automatic gameplay evaluator, and COAST, an agent framework that leverages long-term cue memory to plan and solve sequential tasks. Experimental results show that while current GUI agents struggle to complete the entire story, COAST improves milestone completion rates by bridging the observation-action gap.