Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Language Models Can Predict Their Own Behavior

Created by
  • Haebom

Author

Dhananjay Ashok, Jonathan May

Outline

This paper presents a method for detecting and responding to specific behaviors (e.g., alignment training failures) in the output text of a language model (LM) during deployment. While such behaviors can be identified only after the entire output text has been generated, this paper demonstrates that a trained detector (probe) using only internal representations of input tokens can predict the LM's behavior before a single token is generated. Specifically, we use a conformal prediction method to provide provable bounds on the detector's estimation error and build a sophisticated early warning system that proactively identifies alignment failures (jailbreaks) and instruction-following failures. This system demonstrates a 91% reduction in jailbreaks and is also useful for predicting the model's confidence level and the final predictions of a LM using Chain-of-Thought (CoT) prompting. When applied to a text classification LM using CoT, we achieve an average 65% reduction in inference cost and negligible accuracy loss. Furthermore, our approach generalizes well to uncharted datasets and improves performance on larger models, suggesting its applicability to large-scale models in real-world settings.

Takeaways, Limitations

Takeaways:
Development of an early warning system that predicts the behavior of LM based solely on the internal representation of input tokens.
Reduced sorting failures (jailbreaks) and failures to follow instructions.
Ability to predict the model's confidence level in advance.
Reduced inference cost of LMs using CoT prompting (by 65% on average).
Generalizability to unknown datasets and large-scale models.
Limitations:
Further research is needed on the practical application of the method presented in this paper.
Generalization performance evaluations of various types of LMs and prompting techniques are needed.
Further research is needed to improve the accuracy and reliability of the detector.
👍