Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

H1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning

Created by
  • Haebom

Author

Sumeet Ramesh Motwani, Alesia Ivanova, Ziyang Cai, Philip Torr, Riashat Islam, Shital Shah, Christian Schroeder de Witt, Charles London

Outline

This paper proposes a method to improve long-term inference capabilities by leveraging existing short-term inference data to address the problem of large-scale language models' degraded performance on long-term inference tasks. Specifically, we synthesize simple problems to generate complex multi-level dependency chains of arbitrary length, train the model using rewards that only contain outcomes, and apply a curriculum that automatically increases complexity to ensure scalability of reinforcement learning (RL) training. Using this method, a model trained on synthetic sixth-grade math problems (GSM8K) demonstrates up to a 2.06x accuracy improvement on longer, competitive benchmarks (GSM-Symbolic, MATH-500, and AIME), and generalizes to various ReasoningGym domains and long-term context benchmarks.

Takeaways, Limitations

Takeaways:
Presenting an efficient method to improve long-term inference capabilities by utilizing existing short-term data.
Contributes to solving long-term inference problems by increasing the scalability of RL training.
Demonstrates generalizability by demonstrating performance improvements in mathematical problems and various domains.
We demonstrate that even high-performance models can learn new inference paths.
Achieve exponential improvement in sample complexity compared to full scenario training.
Limitations:
Focusing on a specific type of problem (mathematical problems), generalizability to other types of problems requires further research.
Detailed information about the specific structure or training parameters of the model is not provided, which may make reproduction difficult.
Lack of detailed information for comparison and analysis with other methods.
👍