Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

SAFER: Probing Safety in Reward Models with Sparse Autoencoder

Created by
  • Haebom

Author

Sihang Li, Wei Shi, Ziyuan Xie, Tao Liang, Guojun Ma, Xiang Wang

Outline

Reinforcement Learning-Based Human Feedback (RLHF) is a key paradigm for aligning large-scale language models (LLMs) with human values, but its core reward model remains largely opaque. In this paper, we present Sparse Autoencoder for Enhanced Reward Models (SAFER), a novel framework for interpreting and improving reward models through machine learning. Leveraging sparse autoencoders (SAEs), we identify human-interpretable features in reward model activations, providing insights into safety-related decision-making. We apply SAFER to a safety-oriented preference dataset and quantify the importance of individual features through activation differences between selected and rejected responses. Using these feature-level signals, we design targeted data manipulation and denoising strategies. Experimental results demonstrate that SAFER can accurately degrade or improve safety alignment with minimal data modification, without degrading general chat performance. This approach contributes to interpreting, auditing, and improving reward models in critical LLM alignment tasks.

Takeaways, Limitations

Takeaways:
The SAFER framework contributes to understanding safety-related decision-making in compensation models.
Extract human-interpretable features from the reward model activation through SAE.
We design a strategy to manipulate the safety alignment using feature-level signals.
You can improve or degrade the safety alignment without degrading general chat performance.
Limitations:
The specific Limitations is not explicitly mentioned in the paper.
The thesis topic may be related to LLM safety and may include discussions or examples of potential hazards or unsafe outcomes.
👍