Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time

Created by
  • Haebom

Author

Huihan Li, You Chen, Siyuan Wang, Yixin He, Ninareh Mehrabi, Rahul Gupta, Xiang Ren

Outline

This paper addresses the phenomenon where large-scale language models (LLMs) perform well on inference benchmarks but often fail even when the input is slightly altered. Specifically, we highlight that faulty memory patterns in Chain of Thought (CoT) inference can lead to intermediate errors, resulting in incorrect final answers. To address this, we present STIM, a novel framework. STIM focuses on identifying the source of memory by assigning each token in the inference process to one of several memory sources—local, mid-range, or long-range—based on statistical co-occurrences in the pretraining corpus. Token-level analysis across various tasks and distributional settings reveals that models rely more heavily on memory in complex or long-tailed tasks, with local memory being the primary source of errors (up to 67% of incorrect tokens). We also demonstrate that STIM's memory scores can be used to predict incorrect tokens in erroneous inference steps. STIM is a powerful tool for diagnosing and improving model inference and can be generalized to other structured step-by-step generation tasks.

Takeaways, Limitations

Takeaways:
We present a new framework (STIM) for analyzing the causes of inference errors in LLM at the token level.
When LLM is complex or rare, it relies more on memory, revealing that local memory is the main source of errors.
Using STIM, errors in incorrect inference steps can be predicted.
Also applicable to other structured step-by-step creation tasks.
Limitations:
Performance evaluation of STIM may be limited to specific benchmarks and datasets.
Further research may be needed on the definition and measurement of memory.
It is possible that not all types of inference errors can be perfectly captured.
👍