Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning

Created by
  • Haebom

Author

Zinan Tang, Xin Gao, Qizhi Pei, Zhuoshi Pan, Mengzhang Cai, Jiang Wu, Conghui He, Lijun Wu

Middo: Self-Evolving Model-Informed Dynamic Data Optimization for LLMs

Outline

This paper introduces Middo, a framework for dynamically optimizing training data for Supervised Fine-Tuning (SFT) large-scale language models (LLMs). Middo aims to continuously evolve data to improve model performance, leveraging model-aware data selection and context-preserving data refinement. It identifies inappropriate samples through triaxial model signals (loss patterns, embedding cluster dynamics, and self-alignment scores) and improves them while preserving semantic integrity using an adaptive optimization engine. This framework presents a new paradigm for sustainable LLM training through dynamic co-evolution of data and model. Experimental results demonstrate that Middo achieves an average accuracy improvement of 7.15%, demonstrating superior performance even when maintaining the original dataset size.

Takeaways, Limitations

Takeaways:
A novel approach to dynamically improve the quality of LLM training data is presented.
Building a framework that continuously optimizes data to match the model's learning ability.
Assess and improve data quality by leveraging various model signals.
Presenting the possibility of sustainable LLM training through the mutual evolution of data and models.
Average accuracy improvement of 7.15% while maintaining the size of the original dataset.
Open source data, models, and code.
Limitations:
Lack of detailed information on specific experimental benchmarks and model architecture.
Lack of detailed information on comparative analysis with other data optimization methods.
The generalizability of the framework and its applicability to various LLM architectures need to be verified.
Lack of information on practical considerations such as computational complexity and training time.
👍