To address the uncertainty surrounding the potential of automating geospatial analysis and GIS tasks using Large-Scale Language Models (LLMs), this paper presents GeoAnalystBench, a benchmark consisting of 50 Python-based geoprocessing tasks validated by GIS experts. GeoAnalystBench evaluates minimum output for each task, workflow validity, structural alignment, semantic similarity, and code quality (CodeBLEU). Experimental results show that proprietary models, such as ChatGPT-4o-mini, demonstrate high validity (95%) and code alignment (CodeBLEU 0.39), while open-source models, such as DeepSeek-R1-7B, exhibit incomplete or inconsistent results (validity 48.5%, CodeBLEU 0.272). Tasks requiring deep spatial reasoning, such as spatial relationship detection and optimal site selection, struggled across all models. This demonstrates the potential and limitations of LLMs for GIS automation and provides a reproducible framework for advancing GeoAI research, including human intervention.