Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

Created by
  • Haebom

Author

Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, Zhepeng Cen, Cheng Qian, Shelby Heinecke, Weiran Yao, Silvio Savarese, Caiming Xiong, Huan Wang

Outline

LoCoBench is a comprehensive benchmark specifically designed to evaluate long-context language models (LLMs) with long context windows, reaching millions of tokens, under realistic and complex software development scenarios. Unlike existing code evaluation benchmarks that focus on single-function completion or short-context tasks, LoCoBench addresses the critical evaluation gap for long-context features required to understand entire codebases, reason across multiple files, and maintain architectural consistency in large-scale software systems. It provides 8,000 systematically generated evaluation scenarios across 10 programming languages, with context lengths ranging from 10,000 to 1 million tokens, representing a 100-fold variation, enabling precise assessment of long-context performance degradation in real-world software development environments. It introduces eight task categories that capture long-context features: architectural understanding, cross-file refactoring, multi-session development, bug investigation, functional implementation, code comprehension, integration testing, and security analysis. Through a five-stage pipeline, it generates a diverse and high-quality set of scenarios that require LLMs to reason across complex codebases at unprecedented scale. We introduce a comprehensive evaluation framework, including 17 metrics (including eight new evaluation metrics) across four dimensions and the LoCoBench Score (LCBS). Evaluation results against state-of-the-art long-context models reveal a significant performance gap, highlighting the significant unmet need for context understanding in complex software development. LoCoBench will be released at https://github.com/SalesforceAIResearch/LoCoBench .

Takeaways, Limitations

Takeaways:
We provide a new benchmark to comprehensively evaluate the performance of long-term LLM in real-world software development scenarios.
By revealing important unresolved issues in contextual comprehension, we suggest future research directions.
Supports a wide range of programming languages and task types, enabling a wide range of evaluations.
Precise analysis of contextual performance degradation through contextual length changes of up to 100 times.
More sophisticated evaluation is possible through the introduction of new evaluation indicators.
Limitations:
Lack of detailed description of the benchmark creation process and design of evaluation metrics (more information needed).
The type and number of currently evaluated models may be limited (additional model evaluations are needed).
It may not perfectly reflect all aspects of real-world software development (limiting some scenarios).
👍