Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

QSpec: Speculative Decoding with Complementary Quantization Schemes

Created by
  • Haebom

Author

Juntao Zhao, Wenhao Lu, Sheng Wang, Lingpeng Kong, Chuan Wu

Outline

This paper proposes QSpec, a novel quantization paradigm that improves upon widely used quantization techniques for accelerating Large Language Model (LLM) inference and reducing memory usage. QSpec decouples efficiency and quality by combining low-precision joint quantization for fast drafting and high-precision weight-only quantization for accurate verification. QSpec minimizes transition costs by reusing weights and KV caches across stages without retraining or auxiliary models. It achieves up to 1.64x speedup compared to high-precision-based models and up to 1.55x performance improvement over conventional speculative decoding methods in batch environments. Furthermore, QSpec supports plug-and-play deployment and works well across a variety of model sizes, quantization methods, and workloads.

Takeaways, Limitations

Takeaways:
Combining low-precision and high-precision quantization to improve both speed and accuracy of LLM inference.
Efficient inference through weight and KV cache reuse without retraining.
High flexibility, applicable to various model sizes, quantization methods, and workloads.
Presenting a practical solution for high-quality quantized LLM services in memory-constrained environments.
Performance improvement over existing speculative decoding methods.
Limitations:
There is no specific mention of Limitations in the paper.
👍