Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

Created by
  • Haebom

Author

Eugene Kwek, Wenpeng Yin

COMPACT: Joint Pruning for Efficient Language Models

Outline

This paper proposes a novel pruning technique, COMPACT, to improve the efficiency of large-scale language models (LLMs). COMPACT (i) shrinks the embedding/LM head layer by removing rare words, and (ii) prunes the intermediate channels of a feed-forward network (FFN) using common token-weighted activations. This approach aims to reduce memory usage, latency, and cost while maintaining the standard transformer architecture. Experimental results on Qwen, LLaMA, and Gemma models (0.5B-70B) demonstrate that COMPACT significantly reduces the number of parameters, GPU memory, and latency while maintaining state-of-the-art performance.

Takeaways, Limitations

Takeaways:
Maintains standard transformer structure for ease of deployment.
Flexibly adapts to scale by balancing vocabulary and FFN pruning.
Achieve competitive pruning times, memory savings, and increased throughput.
Achieving state-of-the-art performance across a range of models (0.5B-70B).
Limitations:
The specific Limitations is not specified in the paper. (However, as with all pruning techniques, it is always important to consider how aggressively one can prune without performance degradation.)
👍