Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Do LLMs Understand the Safety of Their Inputs? Training-Free Moderation via Latent Prototypes

Created by
  • Haebom

Author

Maciej Chrab\k{a}szcz, Filip Szatkowski, Bartosz W ojcik, Jan Dubi nski, Tomasz Trzci nski, Sebastian Cygert

Outline

In this paper, we propose a training-free safety assessment method that utilizes the internal information of pre-trained LLMs instead of the traditional expensive guard models to solve the safety and alignment problems of large-scale language models (LLMs). We show that the LLM can recognize harmful inputs through simple prompting and distinguish safe and harmful prompts in the latent space of the model. Based on this, we propose the Latent Prototype Moderator (LPM), a lightweight, custom-built add-on that uses the Mahalanobis distance in the latent space to assess the safety of inputs. The LPM generalizes to various model families and sizes, and performs on par with or better than state-of-the-art guard models on several safety benchmarks.

Takeaways, Limitations

Takeaways:
We increase the efficiency of LLM moderation by providing a training-free alternative to the traditional high-cost guard model.
LPM provides a generalizable, flexible and scalable solution that is independent of model series and size.
We demonstrate that simple prompting and latent space analysis can be used to assess the safety of LLMs.
Achieved state-of-the-art performance across multiple safety benchmarks.
Limitations:
The performance of the proposed method may depend on the LLM used and the prompt engineering.
Additional research may be needed on adaptability to novel types of noxious inputs.
Further research may be needed to explore the interpretability of latent space analysis.
👍