To address the limitations of multimodal large-scale language models (MLLMs), which still struggle with complex visual tasks (e.g., spatial understanding and fine-grained perception), this paper presents SIFThinker, a spatially aware "thinking with images" framework that mimics human visual perception. SIFThinker intersects depth-enhanced bounding boxes with natural language to enable attentional modification and focus on image regions. Using a back-expansion-forward inference strategy, we construct an image-to-text thought process for process-level supervision, which we then construct the SIF-50K dataset. Furthermore, we propose GRPO-SIF, a reinforcement learning paradigm that integrates depth-enhanced visual evidence, to train models to dynamically modify and focus on prompt-relevant regions. Experimental results demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception while maintaining general performance.