Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology

Created by
  • Haebom

Author

Aideen Fay, In es Garc ia-Redondo, Qiquan Wang, Haim Dubossarsky, Anthea Monod

Outline

This study proposes Persistent Homology (PH), a topological data analysis tool, to analyze the impact of adversarial inputs on the internal representation space of large-scale language models (LLMs). It overcomes the limitations of existing interpretability methodologies, which focus on linear directions or isolated features, and focuses on understanding high-dimensional, nonlinear relational geometries. By analyzing six state-of-the-art models under two adversarial environments, including indirect prompt injection and backdoor fine-tuning, we identify consistent topological features of adversarial influence. Our results reveal that adversarial inputs induce "topological compression" of the latent space, simplifying its structure.

Takeaways, Limitations

Takeaways:
A novel framework for understanding the adversarial influence of LLM using PH is presented.
We find a consistent adversarial influence characteristic called “topological compression” across a variety of architectures and model sizes.
Statistically robust, discriminative across layers, and provides interpretable insights into the emergence and propagation of adversarial effects.
We complement existing interpretability methodologies by revealing fundamental invariants for expression changes in LLM.
Limitations:
The specific Limitations is not stated in the abstract. (Please refer to the original text.)
👍