This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Created by
Haebom
Author
Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell
Outline
This paper studies how to integrate risk and capability assessment of large-scale language models (LLMs) into AI risk management and governance frameworks. We point out the limitations of existing input-output assessment methods (the impossibility of realistic full risk assessment and the presentation of only worst-case input-output behavioral lower bounds), and propose a complementary assessment method using model manipulation attacks via latent activation or weight modification. We evaluate state-of-the-art harmful LLM feature removal techniques using five input-space attacks and six model manipulation attacks, and show that the robustness of the model exists in a low-dimensional robustness subspace, and that the success rate of model manipulation attacks provides a conservative estimate of the success rate of holdout input-space attacks. We also show that state-of-the-art unlearning methods can be easily invalidated within 16 steps of fine-tuning. In conclusion, we highlight the difficulty of suppressing harmful LLM features, and show that model manipulation attacks enable much more rigorous assessments than input-space attacks alone cannot.
Takeaways, Limitations
•
Takeaways:
◦
We present a novel method to more rigorously assess the risk of LLM through model manipulation attacks.
◦
Model manipulation attack success rates can be used to predict the success rates of input space attacks.
◦
Highlighting the difficulty of ensuring LLM security by demonstrating the vulnerability of state-of-the-art unlearning techniques.
◦
We show that the robustness of LLM exists in low-dimensional subspaces.
•
Limitations:
◦
Further research is needed to determine the generalizability of the proposed model manipulation attack method.
◦
Experiments with more diverse types of LLM and attack techniques are needed.
◦
Lack of discussion of the real-world applicability and ethical issues of model manipulation attacks.