Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Adversarial Manipulation of Reasoning Models using Internal Representations

Created by
  • Haebom

Author

Kureha Yamaguchi, Benjamin Etheridge, Andy Arditi

Outline

This paper presents research on the vulnerability of an inference model that generates Chain of Thought (CoT) tokens to jailbreak attacks. Unlike existing language models, which make rejection decisions at the prompt-response boundary, we found evidence that the DeepSeek-R1-Distill-Llama-8B model makes rejection decisions within the CoT generation process. We identified a linear direction (attention direction) in the activation space during CoT token generation that predicts whether the model will reject or accept. This direction corresponds to a pattern of deliberate inference in the generated text. Removing this direction from the model activations increases harmful acceptance, effectively enabling the model to be jailbroken. We also demonstrate that the final output can be controlled by manipulating CoT token activations alone, and incorporating this direction into a prompt-based attack improves the success rate. Consequently, our findings suggest that the chain of thoughts itself represents a promising new target for adversarial manipulation of inference models.

Takeaways, Limitations

Takeaways:
We reveal that the chain of events (CoT) generation process is vulnerable to jailbreaking attacks on the inference model.
We show that the "attention" direction that influences the model's rejection/acceptance decision can be identified in the activation space and manipulated to control the model's output.
It suggests the possibility of controlling the final output by simply manipulating the CoT token activation.
We show that incorporating an “attention” direction into prompt-based attacks can increase their success rate.
Suggesting that the thought chain itself could become a new target for adversarial attacks on inference models.
Limitations:
Since this is a study on a specific model (DeepSeek-R1-Distill-Llama-8B), the generalizability of the results to other models is limited.
Further analysis of the exact mechanism and internal workings of the model in the "attention" direction is needed.
Further research is needed to determine the real-world applicability and risks of the proposed attack techniques.
👍