Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization

Created by
  • Haebom

Author

JiangYong Yu, Sifan Zhou, Dawei Yang, Shuo Wang, Shuoyu Li, Xing Hu, Chen Xu, Zukang Xu, Changyong Shu, Zhihang Yuan

Outline

This paper proposes MQuant, a post-training quantization (PTQ) framework for efficient inference of multimodal large-scale language models (MLLMs). To address the challenges of practical deployment and application due to the large parameter size and high computational demands of MLLMs, MQuant introduces modal-specific static quantization (MSQ), attention-invariant flexible switching (AIFS), and rotation scale suppression (RMS) to achieve superior performance over existing PTQ baselines. MSQ assigns separate static scales to visual and textual tokens. AIFS eliminates computationally expensive per-token scale calculations while maintaining casual attention by rearranging token order. RMS mitigates weight outliers caused by online Hadamard rotations. We demonstrate that MQuant reduces inference latency by up to 30% on five leading MLLMs, including Qwen-VL, MiniCPM-V, and CogVLM2, while maintaining near-equivalent floating-point accuracy (<1% degradation) under W4A8. The source code is available on GitHub.

Takeaways, Limitations

Takeaways:
A new PTQ framework, MQuant, is presented for efficient MLLM inference.
Addresses high inference latency of __T78_____ of existing PTQ, distribution mismatch between visual and text tokens, and outlier issues due to Hadamard transform.
Achieve near-floating-point accuracy and reduced inference latency (up to 30%) across a variety of MLLMs.
Increasing the Practicality of MLLM Inference in Resource-Constrained Environments
Ensuring reproducibility and expandability of research through source code disclosure
Limitations:
The effectiveness of the proposed method may be limited to a specific MLLM and quantization setting (W4A8). Further research is needed to determine generalization performance for other MLLMs and quantization settings.
The types of MLLM currently supported are limited, and verification of applicability to a wider range of models is required.
This method is specialized for MLLMs that rely on the Hadamard transform, so it may be difficult to apply it to MLLMs with other architectures.
👍