Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

Created by
  • Haebom

Author

Xi Chen, Aske Plaat, Niki van Stein

Outline

This paper demonstrates that Chain-of-Thought (CoT) prompting effectively improves the accuracy of large-scale language models (LLMs) in multi-step tasks, but it remains unclear whether the generated "thought processes" reflect the actual internal reasoning processes. The researchers present the first feature-level causal study on the reliability of CoT. Combining sparse autoencoders with activation patches, they extract single semantic features while the Pythia-70M and Pythia-2.8B models solve GSM8K math problems under CoT and regular (noCoT) prompting. Replacing a small set of CoT inference features with noCoT execution significantly increases the log-probability of answers in the 2.8B model, but has no reliable effect in the 70M model. This demonstrates a clear scaling threshold. CoT also significantly increases activation sparsity and feature interpretability scores in larger models, indicating more modular internal computation. For example, the confidence in the model's ability to generate correct answers improves from 1.2 to 4.3. By introducing patch curves and random feature patch baselines, we demonstrate that useful CoT information is not limited to the top K patches but is widely distributed. Overall, our results demonstrate that CoT induces a more interpretable internal structure in high-capacity LLM, validating its role as a structured prompting method.

Takeaways, Limitations

Takeaways:
We experimentally demonstrate that CoT prompting leads to a more interpretable and modular internal structure in high-capacity LLMs.
We reveal that the effectiveness of CoT varies significantly with model size (there is a scale threshold).
We've verified that useful CoT information is widely distributed, not just in the top patches.
Improved confidence in the model's ability to generate correct answers through CoT.
Limitations:
The study is limited to a specific model (Pythia) and a specific task (GSM8K math problem).
A complete understanding of the internal workings of CoT is still lacking.
Generalizability to other types of LLMs or jobs requires further research.
👍