This paper addresses the phenomenon where large-scale language models (LLMs) perform well on inference benchmarks but often fail even when the input is slightly altered. Specifically, we highlight that faulty memory patterns in Chain of Thought (CoT) inference can lead to intermediate errors, resulting in incorrect final answers. To address this, we present STIM, a novel framework. STIM focuses on identifying the source of memory by assigning each token in the inference process to one of several memory sources—local, mid-range, or long-range—based on statistical co-occurrences in the pretraining corpus. Token-level analysis across various tasks and distributional settings reveals that models rely more heavily on memory in complex or long-tailed tasks, with local memory being the primary source of errors (up to 67% of incorrect tokens). We also demonstrate that STIM's memory scores can be used to predict incorrect tokens in erroneous inference steps. STIM is a powerful tool for diagnosing and improving model inference and can be generalized to other structured step-by-step generation tasks.