Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Merge-of-Thought Distillation

Created by
  • Haebom

Author

Zhanming Shen, Zeyu Qin, Zenan Huang, Hao Chen, Jiaqi Hu, Yihong Zhuang, Guoshan Lu, Gang Chen, Junbo Zhao

Outline

This paper proposes Merge-of-Thought Distillation (MoT), a novel method for efficiently distilling the inference capabilities of long-range thought processes (CoT) models by leveraging multiple teacher models. To overcome the limitations of conventional distillation methods that rely on a single teacher model, MoT trains a student model by integrating the guidance from multiple teacher models. This iterative process involves fine-tuning the student model for each teacher model and merging the results in the weight space. Applying MoT to the Qwen3-14B student model on a competitive mathematics benchmark using only a small number of high-quality CoT samples, the method outperforms powerful models such as DEEPSEEK-R1, QWEN3-30B-A3B, QWEN3-32B, and OPENAI-O1. MoT outperforms single-teacher distillation and simple multi-teacher integration methods, mitigating overfitting and demonstrating robustness to distribution shift and equally high-quality teacher models. Furthermore, it reduces catastrophic forgetting, enhances general reasoning beyond the mathematical domain, and cultivates better teacher models. These results demonstrate that MoT is a simple and scalable method for efficiently distilling long-range CoT capabilities from diverse teacher models into small student models.

Takeaways, Limitations

Takeaways:
A novel method for efficiently distilling the inference capabilities of the CoT model by leveraging multiple teacher models is presented.
Achieve excellent performance even with small amounts of high-quality data.
Superior performance and robustness compared to existing single-teacher distillation and multi-teacher integration methods.
Reduced catastrophic forgetting and improved general reasoning skills
Suggesting the possibility of cultivating better teacher models
Limitations:
The experimental results presented in this paper are primarily limited to competitive mathematics benchmarks. Further research is needed to determine their generalizability to other domains.
Optimization research is needed on the selection and merging strategies of various teacher models.
Further analysis of MoT's computational cost and memory efficiency is needed.
👍