Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

Created by
  • Haebom

Author

Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Himabindu Lakkaraju, Yizhou Sun, Shichang Zhang

Outline

This paper provides a mechanistic analysis of how the post-training process, essential for transforming a pre-trained large-scale language model (LLM) into a more useful and aligned post-trained model, restructures the internal structure of the LLM. We compare and analyze the base model and the post-trained models across model families and datasets from four perspectives: factual knowledge storage locations, knowledge representations, truth and rejection representations, and confidence levels. We conclude that: First, post-training develops new knowledge representations while adapting the knowledge representations of the base model without altering the factual knowledge storage locations. Second, truth and rejection can be represented as vectors in the hidden representation space, and the truth orientation is highly similar between the base model and the post-trained models and effectively transfers to interventions. Third, the rejection orientation differs between the base model and the post-trained models, exhibiting limited transferability. Fourth, the confidence differences between the base model and the post-trained models cannot be attributed to entropy neurons. This study provides insight into the underlying mechanisms that are maintained and changed during post-training, facilitates subsequent work such as model tuning, and potentially informs future research on interpretability and LLM post-training.

Takeaways, Limitations

Takeaways:
Increased understanding of the fundamental mechanisms of post-training
Contribute to improving follow-up work such as model steering
A New Direction for LLM Interpretability and Post-Training Research
Identifying Changes in Knowledge Representation During Post-Training
Provides an analysis of the vector representation and communicability of truth and denial expressions.
Limitations:
Because the analysis results are for a specific model series and dataset, further research is needed to determine generalizability.
Consideration should be given to the possibility of reliability differences due to factors other than entropy neurons.
Further analysis and improvement measures are needed to address the limited transmissibility of the rejection direction.
👍