This paper addresses aerial vision-language navigation (VLN), a novel task that enables unmanned aerial vehicles (UAVs) to navigate their external environments using natural language commands and visual cues. To address the existing challenge of spatial relationships in complex aerial scenes, this paper proposes a zero-shot framework that requires no training and utilizes a large-scale language model (LLM) as an action prediction agent. Specifically, we develop a novel Semantic-Topological-Measure Representation (STMR) that enhances the spatial reasoning capabilities of the LLM. This is achieved by extracting and projecting semantic masks associated with commands onto a top-down map, which provides spatial and topological information about surrounding landmarks and expands the map during navigation. At each step, a local map centered on the UAV is extracted from the expanded top-down map and transformed into a matrix representation containing distance measures, which serves as a text prompt for the LLM to predict actions for a given command. Experiments conducted in real and simulated environments demonstrated the effectiveness and robustness of the proposed method, achieving absolute success rates of 26.8% and 5.8%, respectively, compared to state-of-the-art methods for simple and complex navigation tasks. The dataset and code will be released soon.