Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models

Created by
  • Haebom

Author

Mariam Mahran, Katharina Simbeck

Outline

To enhance the interpretability of large-scale language models (LLMs), we applied sparse autoencoders (SAEs) to a GPT style transfer model trained on Jane Austen's novels. We analyzed the structure, themes, and biases within the model representation and training data. We discovered interpretable features reflecting core narratives and concepts such as gender, class, and social obligations.

Takeaways, Limitations

The combination of LLM and SAE enables scalable exploration of complex data sets.
We present a novel method for detecting bias in training data and improving the interpretability of models.
Limited to the specific domain of Jane Austen's novels, generalizability to other data sets requires further study.
The complexity of SAE's training and interpretation process can make its practical application difficult.
👍