In this paper, we investigate the role of reinforcement learning (RL) in improving the thought-chain inference capability of large-scale language models (LLMs). First, we show that 'aha moment' patterns (reflection through self-correction) exist even before RL training in multimodal LLMs (MLLMs), but they may not be correlated with improved inference performance. Based on this, we present a two-step approach that combines supervised learning fine-tuning (SFT) using structured thought-chain inference patterns and reinforcement learning using GRPO. Experimental results show that this approach outperforms SFT-only and RL-only methods on various multimodal inference benchmarks. It achieves state-of-the-art performance among open-source MLLMs for both 3B and 7B models, and in particular, the 7B model significantly improves performance over baseline models (e.g., MathVista 66.3% → 73.4%, We-Math 62.9% → 70.4%). This study provides practical guidance for building advanced multimodal inference models, and the code is publicly available on GitHub.