Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

MoQE: Improve Quantization Model performance via Mixture of Quantization Experts

Created by
  • Haebom

Author

Jinhao Zhang, Yunquan Zhang, Boyang Zhang, Zeyu Liu, Daning Cheng

Outline

MoQE proposes a quantization inference framework based on the Mixed Expert (MoE) architecture to improve model efficiency and reduce deployment costs. MoQE combines multiple quantization variants into specialized "quantization experts" and dynamically routes input data to the most appropriate expert based on its characteristics. Experiments on the ImageNet, WikiText, C4, and OpenWebText datasets using ResNet, LLaMA, and Qwen models demonstrate that MoQE achieves performance comparable to state-of-the-art quantization models without significantly increasing inference latency.

Takeaways, Limitations

Takeaways:
We applied the MoE architecture to quantization to solve the performance degradation problem of single quantization models.
We present a lightweight, structure-aware router model designed for CV and NLP tasks.
It showed similar performance compared to the SOTA quantization model.
Improved performance without significantly increasing inference latency.
Limitations:
There is no direct mention of Limitations in the paper.
👍