Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Merge-of-Thought Distillation

Created by
  • Haebom

Author

Zhanming Shen, Zeyu Qin, Zenan Huang, Hao Chen, Jiaqi Hu, Yihong Zhuang, Guoshan Lu, Gang Chen, Junbo Zhao

Outline

This paper proposes Merge-of-Thought Distillation (MoT), a novel framework for efficient inference distillation of long-form thought process (CoT) models by leveraging various teacher models. Unlike conventional distillation methods that rely on a single teacher model, MoT integrates the inference capabilities of multiple teacher models to train a student model. Considering that the optimal teacher model varies across student models and datasets, we propose a lightweight framework that iteratively fine-tunes each teacher's guidance and merges the weight space of the resulting student model variants. Applying MoT to the Qwen3-14B student model using only approximately 200 high-quality CoT samples on a competitive mathematics benchmark, we demonstrate performance improvements that outperform powerful models such as DEEPSEEK-R1, QWEN3-30B-A3B, QWEN3-32B, and OPENAI-O1. Furthermore, MoT outperforms single-teacher distillation and simple multi-teacher integration, mitigating overfitting and demonstrating robustness to distribution shift and equally skilled teachers. Furthermore, MoT demonstrates its effectiveness in reducing catastrophic forgetting, improving general reasoning skills beyond mathematics, and even fostering better teachers. These results demonstrate that MoT is a simple and scalable method for efficiently distilling long-form CoT skills from diverse teachers into smaller student models.

Takeaways, Limitations

Takeaways:
We present a novel method for efficiently distilling the inference ability of long CoT models by utilizing various teacher models.
Achieving powerful performance gains even with limited high-quality data.
Superior performance and robustness over single-teacher distillation and simple multi-teacher aggregation methods.
Reduce catastrophic forgetting and improve general reasoning skills.
Presenting the possibility of fostering better teacher models.
A simple and extensible framework.
Limitations:
Since this paper primarily focuses on competitive mathematics benchmarks, further research is needed on generalization performance in other domains.
Further research may be needed on optimal teacher selection and weighted space merging strategies.
There may be a lack of clear explanation of the definition and selection criteria for the high-quality CoT samples used.
👍