[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Omni-Think: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards

Created by
  • Haebom

Author

Derek Li, Jiaming Zhou, Amirreza Kazemi, Qianyi Sun, Abbas Ghaddar, Mohammad Ali Alomrani, Liheng Ma, Yu Luo, Dong Li, Feng Wen, Jianye Hao, Mark Coates, Yingxue Zhang

Outline

This paper focuses on the advancement of general-purpose artificial intelligence (AGI) based on large-scale language models (LLMs) that show excellent performance on various tasks. To address the problems of conventional supervised fine-tuning (SFT) methods that have difficulty in generalization and show a tendency toward memorization rather than transfer learning, we propose Omni-Think, an integrated reinforcement learning (RL) framework that combines rule-based verifiable rewards and generative preference signals through LLM-as-a-Judge evaluation. Omni-Think enables consistent optimization across diverse task types and extends RL-based learning to the subjective domain. In addition, it demonstrates performance improvement and reduced forgetting through a curriculum-based progression from structured tasks to open-ended tasks. Experimental results on four domains show that curriculum learning improves performance by 5.2% compared to joint learning and 9.1% compared to model merging, emphasizing the importance of task-aware sampling and hybrid supervision for extending RL-based post-learning.

Takeaways, Limitations

Takeaways:
The Omni-Think framework presents an effective RL-based post-training method to improve LLM performance on a variety of tasks.
We demonstrate that curriculum-based learning strategies can achieve improved performance and reduced forgetting.
We highlight the importance of task-aware sampling and hybrid supervision.
It suggests the possibility of extending RL-based learning to the subjective domain.
Limitations:
The presented experimental results are limited to four domains, requiring further research on generalizability.
The subjectivity of LLM-as-a-Judge assessments needs to be taken into account and the potential biases this may introduce must be addressed.
Further research is needed to optimize curriculum design.
The computational cost for large-scale experiments is expected to be significant.
👍