Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Generating High-Quality Datasets for Code Editing via Open-Source Language Models

Created by
  • Haebom

Author

Zekai Zhang, Mingwei Liu, Zhenxi Chen, Linxi Liang, Yuxuan Chen, Guangsheng Ou, Yanlin Wang, Dan Li, Xin Peng, Zibin Zheng

Outline

OpenCodeEdit is an open-source pipeline that synthesizes realistic code editing triplets by leveraging multiple LLMs for code editing, a crucial task in software engineering. This pipeline generates both concise "lazy" instructions and more detailed "descriptive" instructions, and applies diffs and topic-based filtering to ensure data quality and diversity. This resulted in the creation of OCEDataFT, a curated dataset of 20,000 samples. Fine-tuning three advanced baseline models on OCEDataFT significantly improved performance on the CanItEdit benchmark, with a relative improvement in pass@1 from 4.50% to 20.79%. Notably, the generated model achieved performance approaching that of a closed system, narrowing the gap with GPT-4 by 3.54% without requiring proprietary resources or manual annotation.

Takeaways, Limitations

Takeaways:
Generate real-world code editing instructions through open-source pipelines, which improve benchmark performance.
Demonstrating the competitiveness of open-source models by achieving performance close to GPT-4 without proprietary resources.
Create both concise and detailed instructions to address a variety of situations.
Limitations:
Further analysis is needed on the effectiveness of filtering methods to ensure data quality and diversity.
The model's generalization ability and applicability to various code editing tasks need to be further verified.
Further research is needed to understand the specific factors that contributed to the performance improvement (e.g., specific LLMs, fine-tuning strategies).
👍