Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Created by
  • Haebom

Author

Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, Natasha Jaques

Outline

In this paper, we propose SPIRAL, a self-competitive learning framework that improves the reasoning ability of language models without human intervention. SPIRAL is a method in which a model learns by competing with itself through multi-round zero-sum games. This eliminates the need for humans to directly provide problem-answer pairs or design reward functions. For large-scale self-competitive learning, we propose an online multi-round multi-agent reinforcement learning system and a role-conditional advantage estimation (RAE) technique. Experimental results show that the Qwen3-4B-Base model trained only with Kuhn Poker games improves the performance of mathematical reasoning and general reasoning by 8.6% and 8.4%, respectively, and outperforms SFT using 25,000 expert game records. We analyze that this transfer learning is achieved through three cognitive patterns: systematic decomposition, expectation calculation, and case-by-case analysis. Training using various games (tic-tac-toe, Kuhn Poker, simple negotiation) further improves the performance by combining the strengths of each game. Even when SPIRAL is applied to a strong inference model (DeepSeek-R1-Distill-Qwen-7B), it shows an average performance improvement of 2.0%. In conclusion, we show that self-competitive learning through zero-sum games is a promising way to develop transferable inference abilities.

Takeaways, Limitations

Takeaways:
A new method to improve the inference ability of language models without human intervention is presented.
Demonstrating the effectiveness of self-competitive learning using zero-sum games
Securing diversity in reasoning ability through learning through various games
Suggests applicability to various tasks through transfer learning potential
Proposal of an efficient online multi-round multi-agent reinforcement learning system and RAE technique
Limitations:
Currently, the results are limited to a specific type of game. Generalizability to a wider range of games and tasks needs to be verified.
Additional analysis is needed on overfitting or inefficient learning strategies that may arise during self-competitive learning.
Further research is needed on the scalability of SPIRAL and its applicability to various language models.
Further research is needed to determine whether the characteristics of zero-sum games are applicable to all types of reasoning problems.
👍