While large-scale language models (LLMs) have demonstrated strong performance in natural language-to-SQL (NL2SQL) tasks within general databases, extending them to GeoSQL introduces additional complexities due to spatial data types, function calls, and coordinate systems, significantly increasing the difficulty of generation and execution. To address this, we present GeoSQL-Eval, the first end-to-end automated evaluation framework for PostGIS query generation, and GeoSQL-Bench, a benchmark for evaluating LLM performance on NL2GeoSQL tasks. GeoSQL-Bench defines three task categories—conceptual understanding, syntax-level SQL generation, and schema discovery—and consists of 14,178 instances, 340 PostGIS functions, and 82 thematic databases. GeoSQL-Eval is based on Webb's Depth of Knowledge (DOK) model, covering four cognitive dimensions, five skill levels, and 20 task types to build a comprehensive process from knowledge acquisition and syntax generation to semantic alignment, execution accuracy, and robustness. We evaluate 24 representative models across six categories and apply entropy-weighted methods to statistically analyze performance differences, common error patterns, and resource usage. Finally, we launch a public GeoSQL-Eval leaderboard platform for ongoing testing and global comparisons.