Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

Created by
  • Haebom

Author

Lai Wei, Yuting Li, Kaipeng Zheng, Chen Wang, Yue Wang, Linghe Kong, Lichao Sun, Weiran Huang

Outline

In this paper, we investigate the role of reinforcement learning (RL) in improving the thought-chain inference capability of large-scale language models (LLMs). First, we show that 'aha moment' patterns (reflection through self-correction) exist even before RL training in multimodal LLMs (MLLMs), but they may not be correlated with improved inference performance. Based on this, we present a two-step approach that combines supervised learning fine-tuning (SFT) using structured thought-chain inference patterns and reinforcement learning using GRPO. Experimental results show that this approach outperforms SFT-only and RL-only methods on various multimodal inference benchmarks. It achieves state-of-the-art performance among open-source MLLMs for both 3B and 7B models, and in particular, the 7B model significantly improves performance over baseline models (e.g., MathVista 66.3% → 73.4%, We-Math 62.9% → 70.4%). This study provides practical guidance for building advanced multimodal inference models, and the code is publicly available on GitHub.

Takeaways, Limitations

Takeaways:
An effective two-step approach (SFT + RL) to improve the thought chain reasoning in multimodal LLM is presented.
Combining SFT and RL to achieve state-of-the-art performance on open source MLLM.
We reveal that 'aha moment' patterns do not always directly lead to improved inference performance.
Demonstrating scalability for model size by showing performance improvements on both 3B and 7B models.
Limitations:
Further research is needed on the generalization performance of the approach presented in this study.
Experiments on various multi-modal datasets are needed.
A more in-depth analysis of the relationship between 'aha moment' patterns and inference performance is needed.
👍