Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Omni-Thinker: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards

Created by
  • Haebom

Author

Derek Li, Jiaming Zhou, Amirreza Kazemi, Qianyi Sun, Abbas Ghaddar, Mohammad Ali Alomrani, Liheng Ma, Yu Luo, Dong Li, Feng Wen, Jianye Hao, Mark Coates, Yingxue Zhang

Outline

This paper focuses on the advancement of general-purpose AI based on large-scale language models (LLMs) that perform well on a variety of tasks. To address the problems of conventional supervised fine-tuning (SFT) methods that struggle to generalize and focus on memorization rather than transfer learning, we present Omni-Thinker, an integrated reinforcement learning (RL) framework that combines rule-based verifiable rewards and generative preference signals via LLM-as-a-Judge evaluation. Omni-Thinker enables consistent optimization across task types and extends RL-based training to the subjective domain. It demonstrates improved performance and reduced forgetting through a curriculum-based progression from structured tasks to open-ended tasks. Experimental results across four domains show that curriculum learning improves performance by 5.2% over joint training and 9.1% over model merging, emphasizing the importance of task-aware sampling and hybrid supervision in extending RL-based post-training for general-purpose LLMs.

Takeaways, Limitations

Takeaways:
We demonstrate that Omni-Thinker is an effective RL framework for improving LLM performance on a variety of tasks.
We demonstrate that a curriculum-based learning strategy improves the performance and generalization ability of RL-based LLM training.
We highlight the importance of task-aware sampling and hybrid supervision.
We present a novel method to extend RL-based training to the subjective domain.
Limitations:
The presented experiments are limited to four areas, and additional experiments on more diverse tasks and domains are needed.
Further analysis of the reliability and objectivity of the LLM-as-a-Judge assessment is needed.
Further research is needed to optimize and generalize curriculum design.
A more detailed analysis of Omni-Thinker's computational cost and efficiency is needed.
👍