This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
OpenCodeEdit is an open-source pipeline that synthesizes realistic code editing triplets by leveraging multiple LLMs for code editing, a crucial task in software engineering. This pipeline generates both concise "lazy" instructions and more detailed "descriptive" instructions, and applies diffs and topic-based filtering to ensure data quality and diversity. This resulted in the creation of OCEDataFT, a curated dataset of 20,000 samples. Fine-tuning three advanced baseline models on OCEDataFT significantly improved performance on the CanItEdit benchmark, with a relative improvement in pass@1 from 4.50% to 20.79%. Notably, the generated model achieved performance approaching that of a closed system, narrowing the gap with GPT-4 by 3.54% without requiring proprietary resources or manual annotation.
Takeaways, Limitations
•
Takeaways:
◦
Generate real-world code editing instructions through open-source pipelines, which improve benchmark performance.
◦
Demonstrating the competitiveness of open-source models by achieving performance close to GPT-4 without proprietary resources.
◦
Create both concise and detailed instructions to address a variety of situations.
•
Limitations:
◦
Further analysis is needed on the effectiveness of filtering methods to ensure data quality and diversity.
◦
The model's generalization ability and applicability to various code editing tasks need to be further verified.
◦
Further research is needed to understand the specific factors that contributed to the performance improvement (e.g., specific LLMs, fine-tuning strategies).