Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

PDTrim: Targeted Pruning for Prefill-Decode Disaggregation in Inference

Created by
  • Haebom

Author

Hao Zhang, Mengsi Lyu, Zhuo Chen, Xingrun Xing, Yulong Ao, Yonghua Lin

Outline

To address the high computational and memory overhead of large-scale language models (LLMs), this paper proposes a novel model pruning method specialized for prefill-decode (PD) partitioned inference. To overcome the limitations of existing methods, which do not consider the characteristics of PD partitioning, we construct pruning and distillation sets that independently perform iterative block removal for the prefill and decode stages. Furthermore, we introduce a token-aware cache pruning mechanism that retains all KV cache entries in the prefill stage while selectively reusing KV cache entries only for the first and last token sequences of selected layers in the decode stage, thereby minimizing communication costs. Experimental results demonstrate that the proposed method achieves superior performance and faster inference speeds in both PD partitioned and non-partitioned settings, while reducing data transmission bandwidth consumption by 4.95x.

Takeaways, Limitations

Takeaways:
An effective model pruning method considering the characteristics of PD segmentation inference is presented.
Performance improvements through independent block removal in the prefill and decode stages.
Reduced communication costs and improved inference speed through a token-aware cache pruning mechanism.
4.95x reduction in data transmission bandwidth consumption
Limitations:
Further research is needed to determine the generality of the proposed method and its applicability to various LLM architectures.
Further research is needed to optimize parameter tuning for specific LLM and hardware environments.
Further verification of the robustness of performance is needed through more diverse and extensive experiments.
👍