Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models

Created by
  • Haebom

Author

Nan Zhang, Eugene Kwek, Yusen Zhang, Ngoc-Hieu Nguyen, Prasenjit Mitra, Rui Zhang

Outline

This paper explores how compression techniques such as quantization, distillation, and pruning improve the computational efficiency of large-scale inference models (LRMs). Addressing the limitations of previous research, we compare all three compression techniques and conduct in-depth interpretive analysis. We benchmark the DeepSeek-R1 model on four inference datasets and investigate the impact of compression on inference performance through activation-based, fine-grained causal relationship analysis.

Takeaways, Limitations

Takeaways:
The number of weights has a greater impact on knowledge retention in LRM than on inference, highlighting the dangers of pruning and distillation.
We reveal that the last layer of the distilled LRM, MLP up-projection, is one of the key components, providing a new perspective on finding important weights.
Current quantization methods over-compress the last layer modules and MLP gate projections, so protecting just 2% of the over-compressed weights can significantly improve the average accuracy.
Limitations:
The specific Limitations is not directly mentioned in the paper (although potential limitations may exist due to the scope or methodology of the study).
👍