Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

Created by
  • Haebom

Author

Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, George Turkiyyah, Bernard Ghanem

Outline

This paper proposes Rank-One Safety Injection (ROSI), a novel method for improving the safety of large-scale language models (LLMs). ROSI is a simple, rank-one weight modification method that permanently steers model activations into the rejection parameter subspace, without requiring fine-tuning. It computes the required safety directions from a small set of pairs of harmful and harmless directives and applies them to all residual stream write matrices. Evaluation on Llama Guard 3 shows that ROSI consistently improves the safety rejection rate while maintaining the model's utility. Furthermore, we demonstrate that it can amplify and reorder potential safety directions in "uncensored" models, demonstrating its utility as an effective last-step safety procedure. Consequently, goal-directed and interpretable weight steering is an inexpensive and powerful mechanism for improving LLM safety, complementing more resource-intensive fine-tuning paradigms.

Takeaways, Limitations

Takeaways:
Presentation of ROSI as an inexpensive and effective method to improve LLM safety.
Increasing safety rejection rates and maintaining model usability without fine-tuning.
Suggesting the possibility of safety reordering of uncensored models.
Proving the utility of goal-oriented and interpretable weight steering.
It is suggested that it can be used as a complementary technology to existing fine-tuning-based methods.
Limitations:
Further studies are needed to determine the long-term safety and generalizability of ROSI.
The applicability of ROSI to various LLM architectures and safety mechanisms needs to be verified.
Further research is needed on the selection criteria and quality of hazardous/non-hazardous directive pairs used in safety direction calculations.
There is a need to evaluate the robustness of ROSI against real-world malicious attacks.
👍