GeoChain is a large-scale benchmark for evaluating the step-by-step geographic inference of multimodal large-scale language models (MLLMs). Leveraging 1.46 million Mapillary street-level images, we associate a 21-step sequence of questions (over 30 million Q&A pairs) with each image. These sequences guide the model from coarse-grained attributes to fine-grained location identification across four inference categories: visual, spatial, cultural, and precise geolocation, and are annotated by difficulty level. Images are also annotated with semantic segmentation (150 classes) and visual location identification scores. Benchmarking of state-of-the-art MLLMs (GPT-4.1 variants, Claude 3.7, and Gemini 2.5 variants) on diverse subsets of 2,088 images revealed that models consistently struggle with visual evidence, irregular inference, and precise location identification, especially as inference complexity increases. GeoChain provides a robust diagnostic methodology that is crucial for spurring significant advances in complex geographic inference within MLLM.