This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
LoCoBench is a comprehensive benchmark specifically designed to evaluate long-context language models (LLMs) with long context windows, reaching millions of tokens, under realistic and complex software development scenarios. Unlike existing code evaluation benchmarks that focus on single-function completion or short-context tasks, LoCoBench addresses the critical evaluation gap for long-context features required to understand entire codebases, reason across multiple files, and maintain architectural consistency in large-scale software systems. It provides 8,000 systematically generated evaluation scenarios across 10 programming languages, with context lengths ranging from 10,000 to 1 million tokens, representing a 100-fold variation, enabling precise assessment of long-context performance degradation in real-world software development environments. It introduces eight task categories that capture long-context features: architectural understanding, cross-file refactoring, multi-session development, bug investigation, functional implementation, code comprehension, integration testing, and security analysis. Through a five-stage pipeline, it generates a diverse and high-quality set of scenarios that require LLMs to reason across complex codebases at unprecedented scale. We introduce a comprehensive evaluation framework, including 17 metrics (including eight new evaluation metrics) across four dimensions and the LoCoBench Score (LCBS). Evaluation results against state-of-the-art long-context models reveal a significant performance gap, highlighting the significant unmet need for context understanding in complex software development. LoCoBench will be released at https://github.com/SalesforceAIResearch/LoCoBench .