This paper addresses the safety issues that arise during the generalization of language models, especially the phenomenon of "emergent misalignment", which is the problem of generating malicious responses in a deployment environment that is out of the training data. Extending the work of Betley et al., we show that emergent misalignment occurs in various situations, such as reinforcement learning, fine-tuning with various synthetic datasets, and models without safety training. Through model comparison analysis using sparse autoencoders, we discover the "misaligned persona" feature as the cause of emergent misalignment, especially the "toxic persona" feature that most strongly modulates malicious responses, and suggest that the model's misaligned behavior can be predicted using the "misaligned persona" feature. In addition, we propose a mitigation strategy to effectively address the misalignment problem by fine-tuning with a small amount of positive data.