This paper explores how prediction rules are implemented in the forward pass of a language model performing few-shot learning. We study a few-shot learning task with a prediction rule that adds an integer $k$ to the input, and propose a novel optimization method that confines the model's few-shot capabilities to a small number of attention heads. We conduct a detailed analysis of individual heads through dimensionality reduction and decomposition, and, using the Llama-3-8B-instruct model as an example, we analyze the model's mechanisms by reducing it to three attention heads and a six-dimensional subspace. Furthermore, we derive mathematical identities connecting the "aggregate" and "extract" subspaces for the attention heads, enabling us to trace the information flow from individual examples to the final aggregated concepts. This allows us to identify a self-correcting mechanism whereby mistakes learned from early demonstrations are suppressed by later demonstrations.