In this paper, we propose RefineX, a novel pre-training data cleaning framework that addresses the trade-off between efficiency and accuracy, based on the fact that the baseline performance of large-scale language models (LLMs) is largely dependent on the quality of the pre-training corpus. Unlike conventional document-level filtering methods, RefineX performs fine-grained data cleaning through programmatic editing operations. It trains efficient and reliable cleaning models through a high-precision distillation pipeline that extracts the cleaning results obtained under high-quality expert guidance with minimal editing-based deletion programs. Experimental results on models of various sizes show that RefineX consistently outperforms models using raw data, filtered data, or other cleaning methods on a variety of subtasks. In particular, the 750M model achieves an average performance improvement of 2.6%-7.2% on the lighteval task, while achieving similar performance using significantly fewer training tokens. RefineX outperforms conventional end-to-end generation and Prox-C methods in both efficiency and accuracy.