This paper presents a method for detecting and responding to specific behaviors (e.g., alignment training failures) in the output text of a language model (LM) during deployment. While such behaviors can be identified only after the entire output text has been generated, this paper demonstrates that a trained detector (probe) using only internal representations of input tokens can predict the LM's behavior before a single token is generated. Specifically, we use a conformal prediction method to provide provable bounds on the detector's estimation error and build a sophisticated early warning system that proactively identifies alignment failures (jailbreaks) and instruction-following failures. This system demonstrates a 91% reduction in jailbreaks and is also useful for predicting the model's confidence level and the final predictions of a LM using Chain-of-Thought (CoT) prompting. When applied to a text classification LM using CoT, we achieve an average 65% reduction in inference cost and negligible accuracy loss. Furthermore, our approach generalizes well to uncharted datasets and improves performance on larger models, suggesting its applicability to large-scale models in real-world settings.