[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Attend or Perish: Benchmarking Attention in Algorithmic Reasoning

Created by
  • Haebom

Author

Michal Spiegel, Michal \v{S}tef anik, Marek Kadl\v{c} ik, Josef Kucha\v{r}

Outline

In this paper, we propose AttentionSpan, a new benchmark for evaluating how reliably pretrained language models perform algorithmic tasks, especially when they maintain their performance on previously unseen input/output domains. AttentionSpan consists of five tasks with infinite input domains, designed to distinguish between algorithmic understanding and memorization. This allows us to evaluate the ability of the model to generalize to unseen input types, including novel lengths, value ranges, or input domains, and the robustness of the learned mechanism. Through attention map analysis and targeted interventions, we show that the attention mechanism is directly responsible for generalization failure. Implementations of all tasks and interpretability methods are publicly available.

Takeaways, Limitations

Takeaways:
Provides rigorous evaluation criteria for the algorithmic inference capabilities of pre-trained language models.
We present a novel methodology to evaluate the generalization ability and robustness of models.
We reveal the Limitations of the attention mechanism and suggest directions for model improvement.
Increase reproducibility and scalability of research through open code.
Limitations:
The AttentionSpan benchmark consists of five tasks, which may not cover all aspects of algorithmic inference.
Since the presented methodology focuses on the attention mechanism, it may lack analysis of the role of other mechanisms.
Setting up a task with an infinite input domain can lead to differences from real algorithmic problems.
👍