Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

QuantX: A Framework for Hardware-Aware Quantization of Generative AI Workloads

Created by
  • Haebom

Author

Muhammad Ahmad, Khurram Mazher, Saqib Akram, Ahmad Tameem, Saad Bin Nasir

Outline

QuantX is a collection of custom quantization recipes for LLM and VLM. It enables quantization up to 3-bit resolution with minimal performance degradation. QuantX's quantization strategy ensures a flexible trade-off between execution speed, memory requirements, and model accuracy, while considering hardware-specific constraints for efficient dequantization during inference. Experimental results show that QuantX achieves within 6% of the performance of the unquantized model for 3-bit quantized LlaVa-v1.6 on several end-user tasks, outperforming recently published state-of-the-art quantization techniques. We also integrate specific QuantX techniques into the popular Llama.cpp framework and demonstrate their feasibility in terms of execution time compared to mainstream quantization techniques in Llama.cpp. Finally, this paper provides insights into the LLM quantization process that led to the various recipes and options integrated into QuantX.

Takeaways, Limitations

Takeaways:
We present effective LLM and VLM quantization techniques that minimize performance degradation even in low-bit quantization up to 3 bits.
Provides an optimal compromise between execution speed, memory usage, and accuracy, taking into account hardware constraints.
Excellent performance compared to state-of-the-art technologies (less than 6% performance degradation in 3-bit quantization based on LlaVa-v1.6).
Demonstrating practical applicability through integration with Llama.cpp.
Provides deep insights into the LLM quantization process.
Limitations:
Results are presented primarily for a specific model (LlaVa-v1.6) and framework (Llama.cpp), and further research is needed to determine generalizability.
Lack of extensive experimentation across diverse hardware platforms.
Lack of performance evaluation for lower bit quantization below 3 bits.
👍