Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression

Created by
  • Haebom

Author

Mohammad Mozaffari, Amir Yazdanbakhsh, Maryam Mehri Dehnavi

Outline

This paper presents SLIM, a novel one-shot compression framework for large-scale language models (LLMs) that addresses the memory consumption and slow inference problems. While existing model compression techniques require computationally expensive retraining to maintain accuracy, SLIM reduces model size without retraining while maintaining accuracy. SLIM works by integrating hardware-friendly quantization, sparsity, and low-dimensional approximation. Its key components include probabilistic quantization (SLIM-Quant), semi-structured sparsity using conventional one-shot pruning, and computation of a low-dimensional adapter based on a novel importance function to compensate for quantization and sparsity errors. Experimental results demonstrate that SLIM achieves up to 5.66% higher accuracy than existing methods, reduces memory usage by up to 0.23x, and achieves up to 4.3x and 3.8x speedups on Nvidia RTX3060 and A100 GPUs, respectively. Additionally, we show that additional accuracy improvements can be obtained through the optional Parameter-Efficient Fine-Tuning (PEFT) recipe.

Takeaways, Limitations

Takeaways:
We present a novel one-shot compression framework, SLIM, which can effectively improve the memory usage and inference speed of LLM without retraining.
Achieved higher accuracy than existing one-shot compression methods.
Increased efficiency by incorporating hardware-friendly quantization, sparsity, and low-dimensional approximations.
Additional accuracy improvements can be achieved through optional PEFT recipes.
Limitations:
The performance of the presented method may vary depending on the specific LLM and hardware environment.
Using the PEFT recipe requires additional training time.
Performance evaluations in more diverse LLM and hardware environments are needed.
There is a lack of analysis of the relative importance of each component (quantization, sparsity, low-dimensional approximation) that contributes to improving the performance of SLIM.
👍