This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
To address the limitations of post-process quantization (PTQ) and the high memory overhead of quantization-aware training (QAT), among quantization techniques for reducing the deployment cost of large language models (LLMs), we propose a zero-order optimization-based QAT framework called ZeroQAT. ZeroQAT eliminates backpropagation, thereby reducing computational and memory overhead while maintaining the benefits of end-to-end optimization. Furthermore, we introduce a lightweight variant of ZeroQAT for quantized fine-tuning, further reducing memory usage. Experimental results demonstrate that ZeroQAT outperforms leading PTQ and QAT-based models while requiring significantly less memory. For example, it can fine-tune a 13-byte model on a single 8GB GPU and a 6.7-byte model on a OnePlus 12 smartphone.
Takeaways, Limitations
•
Takeaways:
◦
ZeroQAT performs end-to-end QAT without backpropagation, enabling quantization of LLM even in memory-constrained environments.
◦
Even at extremely low bit widths such as 2-4 bits, the 13B model can be fine-tuned on a single 8GB GPU.
◦
We demonstrate that fine-tuning of LLM is possible even in resource-constrained environments such as smartphones.
•
Limitations:
◦
The specific Limitations is not specified in this paper. (Judging solely from the content of the abstract)