Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

Created by
  • Haebom

Author

Pranjal Aggarwal, Sean Welleck

Outline

Length Controlled Policy Optimization (LCPO) is a simple reinforcement learning method that optimizes accuracy while respecting user-specified length constraints. We trained an L1 inference language model using LCPO. L1 generates outputs that satisfy the length constraints provided in the prompt. Controlling the length of L1 allows for a smooth trade-off between computational cost and accuracy across a variety of tasks, outperforming the existing S1 method. Furthermore, we discovered unexpected short chain-of-thought capabilities in models trained using LCPO. Specifically, we developed Short Reasoning Models (SRMs) using LCPO, which exhibit inference patterns similar to full-length reasoning models but can produce CoT lengths similar to non-inference models. The 1.5B L1 model significantly outperformed GPT-4o at the same inference length.

Takeaways, Limitations

Takeaways:
LCPO enables control of the inference length of the inference model.
A flexible trade-off between computational cost and accuracy is possible.
SRM development allows us to achieve high performance even with a short chain-of-thought.
The L1 model outperforms the conventional S1 method.
It achieved performance that outperformed GPT-4o at the same inference length.
Limitations:
The specific Limitations is not mentioned in the paper.
👍