Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MIB: A Mechanistic Interpretability Benchmark

Created by
  • Haebom

Author

Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iv an Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov

Outline

In this paper, we propose the Mechanistic Interpretability Benchmark (MIB) to provide a reliable evaluation criterion for mechanistic interpretability methods. The MIB consists of two tracks (circuit localization track and causal variable localization track) that contain four tasks and five models. The circuit localization track compares methods for finding the most important parts of model components and their connections for performing a task (e.g., attribution patching or information flow paths), and the causal variable localization track compares methods for characterizing latent vectors (e.g., sparse autoencoder (SAE) or distributed alignment search (DAS)) and aligning the features to task-relevant causal variables. Experimental results show that attribution and mask optimization methods perform best in circuit localization, supervised learning DAS methods perform best in causal variable localization, and SAE features do not outperform neurons (uncharacterized latent vectors). In conclusion, the MIB enables meaningful comparisons, thereby increasing the confidence in practical advances in the field.

Takeaways, Limitations

Takeaways:
MIB provides objective criteria for evaluating whether a method of mechanical interpretability has made a substantial improvement.
We comprehensively evaluate two important aspects: circuit localization and causal variable localization.
By comparing and analyzing the performance of various methods, the strengths and weaknesses of each method are clearly presented.
It increases the reliability of developments in the field and suggests future research directions.
Limitations:
Currently, MIB is limited to only four tasks and five models, requiring further validation of its generalizability.
Further discussion may be needed regarding the objectivity and fairness of the evaluation criteria.
Extensions to more diverse models and tasks are needed.
It may be biased towards certain types of models or tasks.
👍