Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval

Created by
  • Haebom

Author

Jiahui Geng, Fengyu Cai, Shaobo Cui, Qing Li, Liangwei Chen, Chenyang Lyu, Haonan Li, Derui Zhu, Walter Pretschner, Heinz Koeppl, Fakhri Karray

Outline

This paper proposes CoQuIR, a large-scale, multilingual benchmark for evaluating the quality-awareness of code retrieval, essential for improving code reuse and debugging speed in software development. Unlike existing benchmarks that focus solely on functional relevance, CoQuIR provides fine-grained quality annotations for 42,725 queries and 134,907 code snippets across 11 programming languages, considering four core dimensions: accuracy, efficiency, security, and maintainability. Using two quality-focused evaluation metrics—Pairwise Preference Accuracy and Margin-based Ranking Score—we benchmark 23 retrieval models and find that even the best-performing models struggle to distinguish buggy or unsafe code from more robust code. Furthermore, we conduct a preliminary investigation into training methods that explicitly encourage code quality awareness, demonstrating improvements in quality-awareness metrics across various models using synthetic datasets. We then validate the effectiveness of our approach through subsequent code generation experiments. In conclusion, this study highlights the importance of integrating quality signals into code search systems, laying the foundation for more reliable and robust software development tools.

Takeaways, Limitations

Takeaways:
It emphasizes the importance of considering code quality (correctness, efficiency, security, maintainability) in code search systems.
We provide a large-scale, multilingual benchmark, CoQuIR, to accurately evaluate the quality recognition capabilities of code search models.
We demonstrate that quality-focused training methods can improve quality perception performance.
Laying the foundation for developing more reliable and robust software development tools.
Limitations:
Because the CoQuIR benchmark is based on preliminary experimental results using synthetic datasets, further validation of its performance on real-world datasets is needed.
Further research is needed to determine the generalizability of the proposed quality-focused training method.
There may be a lack of discussion about the limitations and potential for improvement of evaluation metrics.
👍