Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance

Created by
  • Haebom

Author

Lixuan He, Jie Feng, Yong Li

Outline

This paper proposes a novel approach to overcome the limitations of the existing two-stage pipeline for improving the inference performance of large-scale language models (LLMs): supervised learning fine-tuning (SFT) and reinforcement learning (RL). This approach views SFT and RL as complementary reward signals. To address the drawbacks of existing methods, such as catastrophic forgetting and the suboptimal trade-off between imitation and exploration, we propose Adaptive Meta-Fine-Tuning (AMFT), a single-stage algorithm that learns the optimal balance between the path-level rewards of SFT and the outcome-based rewards of RL by introducing the concept of implicit rewards. At the core of AMFT is a meta-gradient adaptive weight controller that dynamically optimizes the SFT-RL balance as a learnable parameter to maximize long-term task performance. It autonomously discovers effective learning processes by ensuring stability using policy entropy. AMFT achieves state-of-the-art performance on a variety of benchmarks, including mathematical reasoning, abstract visual reasoning (General Points), and visual-language exploration (V-IRL), and demonstrates excellent generalization performance on out-of-distribution (OOD) tasks. Through ablation studies and learning dynamic analysis, we demonstrate that meta-learning controllers play a crucial role in the stability, sample efficiency, and performance of AMFT.

Takeaways, Limitations

Takeaways:
We present a new single-stage learning algorithm, AMFT, that overcomes the limitations of the two-stage pipeline methods of existing SFT and RL.
Effectively integrate reward signals from SFT and RL by introducing the concept of implicit reward.
Improving long-term task performance by dynamically optimizing SFT-RL equilibrium via meta-gradient adaptive weight controller.
Achieved state-of-the-art performance and excellent generalization performance across various benchmarks.
Providing reproducibility and scalability of research through open source code disclosure.
Limitations:
Potential increase in computational cost due to the complexity of the AMFT algorithm.
Further validation is needed to confirm the potential for optimization for specific benchmarks and generalization performance for other types of tasks.
A more in-depth analysis and interpretation of the behavior of meta-gradient adaptive weight controllers is needed.
👍