Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

CAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree

Created by
  • Haebom

Author

Yilin Zhang, Xinran Zhao, Zora Zhiruo Wang, Chenyang Yang, Jiayi Wei, Tongshuang Wu

Outline

We identify the problem of chunking (the process of dividing documents into searchable units), which plays a crucial role in large-scale code generation based on Retrieval-Augmented Generation (RAG), and propose a structure-aware chunking methodology utilizing Abstract Syntax Trees (AST) to address this issue. The proposed methodology recursively splits AST nodes and merges sibling nodes while respecting size constraints to create self-contained units that are semantically consistent across languages and tasks. It demonstrates performance improvements across various code generation tasks, such as improving Recall@5 by 4.3 points in RepoEval retrieval and Pass@1 by 2.67 points in SWE-bench generation.

Takeaways, Limitations

We highlight the importance of structure-aware chunking methodology and suggest its potential to improve the performance of RAG-based code generation pipelines.
Improve performance across a variety of code generation tasks by generating semantically consistent code fragments through AST-based chunking.
We demonstrate the effectiveness of our methodology by presenting concrete performance improvement figures in RepoEval searches and SWE-bench generation.
The specific Limitations is not specified in the paper.
👍