Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

GeoAnalystBench: A GeoAI benchmark for assessing large language models for spatial analysis workflow and code generation

Created by
  • Haebom

Author

Qianheng Zhang, Song Gao, Chen Wei, Yibo Zhao, Ying Nie, Ziru Chen, Shijie Chen, Yu Su, Huan Sun

Outline

To address the uncertainty surrounding the potential of automating geospatial analysis and GIS tasks using Large-Scale Language Models (LLMs), this paper presents GeoAnalystBench, a benchmark consisting of 50 Python-based geoprocessing tasks validated by GIS experts. GeoAnalystBench evaluates minimum output for each task, workflow validity, structural alignment, semantic similarity, and code quality (CodeBLEU). Experimental results show that proprietary models, such as ChatGPT-4o-mini, demonstrate high validity (95%) and code alignment (CodeBLEU 0.39), while open-source models, such as DeepSeek-R1-7B, exhibit incomplete or inconsistent results (validity 48.5%, CodeBLEU 0.272). Tasks requiring deep spatial reasoning, such as spatial relationship detection and optimal site selection, struggled across all models. This demonstrates the potential and limitations of LLMs for GIS automation and provides a reproducible framework for advancing GeoAI research, including human intervention.

Takeaways, Limitations

Takeaways:
GeoAnalystBench provides a standard benchmark to objectively evaluate the geospatial analysis capabilities of LLMs.
We clearly present the performance differences between proprietary and open-source models, suggesting future model development directions.
We highlight the limitations of LLM for tasks requiring deep spatial reasoning and suggest directions for further research.
We emphasize the importance of GeoAI research based on human-LLM collaboration.
Limitations:
Current benchmarks are limited to Python-based tasks.
It may be difficult to completely eliminate the subjectivity of evaluation indicators.
Further research is needed on different types of geospatial data and tasks.
The type and number of tasks included in a benchmark may be limited.
👍