This paper studies subliminal learning, a surprising phenomenon in which language models transfer behavioral traits from semantically irrelevant data. In a key experiment where a “teacher” model generates a dataset consisting of only sequences of digits with features T, such as liking owls or being misaligned, a “student” model trained on this dataset learns the features T. This phenomenon occurs even when data with references to the features T removed is used. The same effect is observed when training using code or inference processes generated by the same teacher model. However, this effect is not observed when the underlying models of the teacher and student models are different. To explain these results, the researchers prove theoretical results that latent learning occurs in all neural networks under certain conditions, and demonstrate latent learning in a simple MLP classifier. Their conclusion suggests that latent learning is a common phenomenon that poses unexpected risks in AI development. Unintended features can be propagated through knowledge distillation, even when developers attempt to prevent it by filtering the data.