This paper explores an approach that uses natural language (NL) test cases for GUI application verification, focusing specifically on the potential for LLM agents to directly execute NL test cases. To address the unsoundness and execution consistency issues of NL test cases, we propose an algorithm that executes NL test cases using a guardrail mechanism and a specialized agent that dynamically verifies each test step. Furthermore, we present metrics for evaluating test execution performance and execution consistency, and define weak unsoundness, which characterizes acceptable NL test case execution conditions at an industrial quality level (Six Sigma). Experiments using eight publicly available LLMs ranging from 3B to 70B parameters demonstrate the potential and current limitations of LLM agents for GUI testing. Experimental results show that Meta Llama 3.1 70B exhibits acceptable performance with high execution consistency (greater than 3 Sigma). A prototype tool, test suite, and results are also provided.