Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Prefill-level Jailbreak: A Black-Box Risk Analysis of Large Language Models

Created by
  • Haebom

Author

Yakai Li, Jiekang Hu, Weiduan Sang, Luping Ma, Dongsheng Nie, Weijuan Zhang, Aimin Yu, Yi Su, Qingjia Huang, Qihang Zhou

Outline

This paper presents the results of a study on replay attacks, a security threat to large-scale language models (LLMs), focusing on attacks that exploit the user-controlled response prefill feature, rather than the prompt-level attacks primarily addressed in previous studies. Prefill allows attackers to manipulate the beginning of the model output, shifting the attack paradigm from persuasion-based attacks to direct manipulation of model state. Black-box security analysis was performed on 14 LLMs to classify replay attacks at the prefill level and evaluate their effectiveness. Experimental results show that attacks using adaptive methods achieved success rates exceeding 99% across multiple models, and token-level probability analysis confirmed that initial state manipulation caused a shift in the first token probability from rejection to cooperation. Furthermore, we demonstrate that replay attacks at the prefill level effectively enhance the success rate of existing prompt-level attacks by 10-15 percentage points. Evaluation of several defense strategies revealed that existing content filters offer limited protection, and that detection methods focusing on the operational relationship between prompts and prefill are more effective. In conclusion, we expose vulnerabilities in the current LLM security alignment and emphasize the need to address pre-fill attack surfaces in future security training.

Takeaways, Limitations

Takeaways:
We reveal the existence and severity of a new type of re-break attack that leverages user-controlled response prefill functionality.
We show that prefilling attacks can amplify existing prompt-based attacks.
It exposes the limitations of existing content filters and suggests the need for a new detection method based on the relationship between prompts and prefills.
Suggesting research directions for improving LLM security (responding to prefill attacks).
Limitations:
Limits on the types and number of models to be analyzed (14 models).
Further research is needed to determine the generalizability of the proposed detection method and its application to real-world environments.
A comprehensive analysis of the different types of pre-filling attacks may be lacking.
👍