To address the limitations of multimodal large-scale language models (MLLMs), which still struggle with complex visual tasks (e.g., spatial understanding and fine-grained perception), this paper presents SIFThinker, a spatially aware "thinking in images" framework that mimics human visual perception. SIFThinker intersects depth-enhanced bounding boxes with natural language to enable attention correction and focus on image regions. Using a back-extension-forward inference strategy, we facilitate the generation of image-to-text thought chains for process-level supervision, building the SIF-50K dataset. Furthermore, we propose GRPO-SIF, a reinforcement learning paradigm that integrates depth-enhanced visual foundations to train models to dynamically correct and focus on prompt-relevant regions. Experimental results demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception while maintaining robust generalization.