Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

On Jailbreaking Quantized Language Models Through Fault Injection Attacks

Created by
  • Haebom

Author

Noeldin Zahran, Ahmad Tahmasivand, Ihsen Alouani, Khaled Khasawneh, Mohammed E. Fouda

Outline

This paper investigates the effectiveness of parameter manipulation attacks (e.g., fault injection) on large-scale language models (LLMs) with improved efficiency using low-precision quantization. In particular, we propose gradient-based attacks, i.e., bit-wise search algorithm and word-wise attack, and evaluate them on Llama-3.2-3B, Phi-4-mini, and Llama-3-8B models under FP16 (baseline), FP8, INT8, and INT4 quantization schemes. The experimental results show that the attack success rates vary significantly depending on the quantization scheme. The attack success rate is high on the FP16 model, but significantly low on the FP8 and INT8 models. In addition, the successful attack on the FP16 model maintains its high success rate after FP8/INT8 quantization, but the success rate decreases significantly on the INT4 model. This suggests that although general quantization techniques such as FP8 increase the difficulty of direct parameter manipulation attacks, vulnerabilities can still exist, especially through post-attack quantization.

Takeaways, Limitations

Takeaways:
Low-precision quantization (especially FP8) is shown to be effective in defending large-scale language models against direct parameter manipulation attacks.
Depending on the quantization method, the attack success rate and attack target location vary.
An attack that succeeds on FP16 retains a significant success rate after quantization to FP8/INT8, but quantization to INT4 reduces the success rate significantly.
Suggesting the need to consider security aspects when establishing a model quantization strategy.
Limitations:
Evaluation of restricted models and quantization methods.
Limited generalizability to other types of attacks.
Lack of analysis of metrics other than attack success rate (e.g. attack complexity).
👍