Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs

Created by
  • Haebom

Author

Baolong Bi, Shenghua Liu, Xingzhang Ren, Dayiheng Liu, Junyang Lin, Yiwei Wang, Lingrui Mei, Junfeng Fang, Jiafeng Guo, Xueqi Cheng

Outline

In this paper, we propose RefineX, a novel pre-training data cleaning framework that addresses the trade-off between efficiency and accuracy, based on the fact that the baseline performance of large-scale language models (LLMs) is largely dependent on the quality of the pre-training corpus. Unlike conventional document-level filtering methods, RefineX performs fine-grained data cleaning through programmatic editing operations. It trains efficient and reliable cleaning models through a high-precision distillation pipeline that extracts the cleaning results obtained under high-quality expert guidance with minimal editing-based deletion programs. Experimental results on models of various sizes show that RefineX consistently outperforms models using raw data, filtered data, or other cleaning methods on a variety of subtasks. In particular, the 750M model achieves an average performance improvement of 2.6%-7.2% on the lighteval task, while achieving similar performance using significantly fewer training tokens. RefineX outperforms conventional end-to-end generation and Prox-C methods in both efficiency and accuracy.

Takeaways, Limitations

Takeaways:
We present a novel method to efficiently and accurately improve the quality of pre-training data for large-scale language models.
Overcomes the limitations of existing document-level filtering methods and enables detailed data purification.
Excellent performance can be achieved even with fewer training tokens.
Observed consistent performance improvements across multiple subtasks.
Limitations:
There may be a lack of performance evaluation for tasks other than the lighteval task presented in this paper.
There may be a lack of detailed explanation of programmatic editing tasks in RefineX.
Further research may be needed on generalization performance across different languages and domains.
Review of the objectivity and reproducibility of the purification results obtained under expert guidance may be necessary.
👍