Reinforcement Learning-Based Human Feedback (RLHF) is a key paradigm for aligning large-scale language models (LLMs) with human values, but its core reward model remains largely opaque. In this paper, we present Sparse Autoencoder for Enhanced Reward Models (SAFER), a novel framework for interpreting and improving reward models through machine learning. Leveraging sparse autoencoders (SAEs), we identify human-interpretable features in reward model activations, providing insights into safety-related decision-making. We apply SAFER to a safety-oriented preference dataset and quantify the importance of individual features through activation differences between selected and rejected responses. Using these feature-level signals, we design targeted data manipulation and denoising strategies. Experimental results demonstrate that SAFER can accurately degrade or improve safety alignment with minimal data modification, without degrading general chat performance. This approach contributes to interpreting, auditing, and improving reward models in critical LLM alignment tasks.