Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Learning to Reason as Action Abstractions with Scalable Mid-Training RL

Created by
  • Haebom

Author

Shenao Zhang, Donghan Yu, Yihao Feng, Bowen Jin, Zhaoran Wang, John Peebles, Zirui Wang

Outline

Large-scale language models (LLMs) demonstrate impressive performance when combined with reinforcement learning (RL), but leveraging this potential requires intermediate training steps. An effective intermediate training step should identify a condensed set of useful actions and enable rapid selection via online RL. This study presents the first theoretical results on how intermediate training shapes post-training. This step characterizes an action subspace that minimizes both the value approximation error due to pruning and the RL error during subsequent planning. The results reveal two key determinants of the effectiveness of intermediate training: the efficiency of pruning, which shapes the prior of the initial RL policy, and its impact on RL convergence, which controls the extent to which the policy can be refined through online interactions. Based on these findings, this study proposes Reasoning as Action Abstractions (RA3), a scalable intermediate training algorithm. RA3 derives a sequential variational lower bound and optimizes it by iteratively discovering temporally coherent latent structures via RL and fine-tuning them on bootstrapped data. We demonstrate the effectiveness of RA3 through experiments on code generation tasks.

Takeaways, Limitations

Takeaways:
Intermediate training plays an important role in the combination of LLM and RL, and is particularly effective in action abstraction spaces.
The RA3 algorithm outperforms existing methods in code generation tasks.
RA3 achieves fast convergence and high asymptotic performance.
Limitations:
The theoretical results in this paper focus on a specific aspect of intermediate training, and generalization to a broader range of LLM-RL systems requires further research.
RA3 performance may vary depending on specific tasks and models, and performance verification in various environments is required.
Analysis of the optimal parameter settings for RA3 is lacking, and further experiments are needed.
👍