This paper analyzes the ability of several code generation models to handle long-range dependencies using multi-step key search tasks with context windows of up to 8,000 tokens in length. By using progressively more difficult tasks, we evaluate model performance in a more fine-grained manner than a simple “needle-finding” test. In particular, we find that many models exhibit performance degradations of up to two orders of magnitude when a function refers to another function defined later in the prompt. We also find that models using sliding window attention mechanisms struggle to handle references that are farther away than a single window size. We show that simple prompt modifications using call-graph information can improve multi-step search performance by up to three orders of magnitude. This analysis highlights the need for a deeper consideration of long-text context performance beyond single-fact retrieval in documents.