Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling

Created by
  • Haebom

Author

Derek Li, Jiaming Zhou, Leo Maxime Brunswic, Abbas Ghaddar, Qianyi Sun, Liheng Ma, Yu Luo, Dong Li, Mark Coates, Jianye Hao, Yingxue Zhang

Omni-Thinker: BWT-Aware Scheduling and Hybrid Supervision for Scaling RL-Based Post-Training toward General-Purpose LLMs

Outline

This paper presents research to develop a large-scale language model (LLM) capable of both structured inference and open-ended generation. Omni-Thinker is an integrated reinforcement learning (RL) framework that extends LLM across diverse tasks by combining hybrid rewards and back-transfer-guided scheduling. Hybrid rewards integrate rule-based verifiable signals with preference-based evaluations from LLM-as-a-Judge, enabling learning in both deterministic and subjective domains. The scheduler reduces forgetting and improves multi-task performance by arranging tasks based on back-transfer accuracy (BWT). Experiments across four domains demonstrate a 6.2% improvement over joint training and a 12.4% improvement over model merging. Furthermore, we demonstrate that simple assumptions about back-transfer accuracy provide accurate predictions of curriculum outcomes, and that entropy dynamics account for variance due to generative tasks.

Takeaways, Limitations

Takeaways:
Improving RL-based LLM follow-up learning with hybrid reward and BWT-based scheduling.
Contributes to improving the performance of LLM across a variety of tasks.
Emphasizes the importance of scheduling using BWT.
Predicting curriculum outcomes and suggesting the possibility of explaining the entropy dynamics of generative work.
Limitations:
There is no specific mention of Limitations in the paper.
👍