Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Evaluating Long Range Dependency Handling in Code Generation LLMs

Created by
  • Haebom

Author

Yannick Assogba, Donghao Ren

Outline

This paper analyzes the ability of several code generation models to handle long-range dependencies using multi-step key search tasks with context windows of up to 8,000 tokens in length. By using progressively more difficult tasks, we evaluate model performance in a more fine-grained manner than a simple “needle-finding” test. In particular, we find that many models exhibit performance degradations of up to two orders of magnitude when a function refers to another function defined later in the prompt. We also find that models using sliding window attention mechanisms struggle to handle references that are farther away than a single window size. We show that simple prompt modifications using call-graph information can improve multi-step search performance by up to three orders of magnitude. This analysis highlights the need for a deeper consideration of long-text context performance beyond single-fact retrieval in documents.

Takeaways, Limitations

Takeaways:
We present a more granular evaluation method for the ability to handle long-distance dependencies.
Clarifying the __T168259_____ capabilities of long-form context handling in code generation models (especially the limitations of inter-function references and sliding window attention mechanisms).
Suggesting the possibility of improving performance by utilizing call graph information.
Emphasizes the need for in-depth long-form context performance evaluation beyond simple fact retrieval.
Limitations:
Limits on the type and number of code generation models used in the analysis.
Further research is needed to determine whether performance improvements through the use of call graph information are applicable to all cases.
Lack of performance analysis for contexts larger than 8,000 tokens.
👍