This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling
Created by
Haebom
Author
Derek Li, Jiaming Zhou, Leo Maxime Brunswic, Abbas Ghaddar, Qianyi Sun, Liheng Ma, Yu Luo, Dong Li, Mark Coates, Jianye Hao, Yingxue Zhang
Omni-Thinker: BWT-Aware Scheduling and Hybrid Supervision for Scaling RL-Based Post-Training toward General-Purpose LLMs
Outline
This paper presents research to develop a large-scale language model (LLM) capable of both structured inference and open-ended generation. Omni-Thinker is an integrated reinforcement learning (RL) framework that extends LLM across diverse tasks by combining hybrid rewards and back-transfer-guided scheduling. Hybrid rewards integrate rule-based verifiable signals with preference-based evaluations from LLM-as-a-Judge, enabling learning in both deterministic and subjective domains. The scheduler reduces forgetting and improves multi-task performance by arranging tasks based on back-transfer accuracy (BWT). Experiments across four domains demonstrate a 6.2% improvement over joint training and a 12.4% improvement over model merging. Furthermore, we demonstrate that simple assumptions about back-transfer accuracy provide accurate predictions of curriculum outcomes, and that entropy dynamics account for variance due to generative tasks.
Takeaways, Limitations
•
Takeaways:
◦
Improving RL-based LLM follow-up learning with hybrid reward and BWT-based scheduling.
◦
Contributes to improving the performance of LLM across a variety of tasks.
◦
Emphasizes the importance of scheduling using BWT.
◦
Predicting curriculum outcomes and suggesting the possibility of explaining the entropy dynamics of generative work.
•
Limitations:
◦
There is no specific mention of Limitations in the paper.