Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost

Created by
  • Haebom

Author

Qitao Tan, Xiaoying Song, Jin Lu, Guoming Li, Jun Liu, Lingzi Hong, Caiwen Ding, Jundong Li, Xiaoming Zhai, Shaoyi Huang, Wei Niu, Geng Yuan

Outline

To address the limitations of post-process quantization (PTQ) and the high memory overhead of quantization-aware training (QAT), among quantization techniques for reducing the deployment cost of large language models (LLMs), we propose a zero-order optimization-based QAT framework called ZeroQAT. ZeroQAT eliminates backpropagation, thereby reducing computational and memory overhead while maintaining the benefits of end-to-end optimization. Furthermore, we introduce a lightweight variant of ZeroQAT for quantized fine-tuning, further reducing memory usage. Experimental results demonstrate that ZeroQAT outperforms leading PTQ and QAT-based models while requiring significantly less memory. For example, it can fine-tune a 13-byte model on a single 8GB GPU and a 6.7-byte model on a OnePlus 12 smartphone.

Takeaways, Limitations

Takeaways:
ZeroQAT performs end-to-end QAT without backpropagation, enabling quantization of LLM even in memory-constrained environments.
Even at extremely low bit widths such as 2-4 bits, the 13B model can be fine-tuned on a single 8GB GPU.
We demonstrate that fine-tuning of LLM is possible even in resource-constrained environments such as smartphones.
Limitations:
The specific Limitations is not specified in this paper. (Judging solely from the content of the abstract)
👍