Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization

Created by
  • Haebom

Author

Shengzhuang Chen, Xu Ouyang, Michael Arthur Leopold Pearce, Thomas Hartvigsen, Jonathan Richard Schwarz

Outline

This paper addresses the problem of determining optimal data mixing ratios for large-scale language model training. Instead of conventional heuristic search methods, we approach data mixing ratio selection as a black-box hyperparameter optimization problem, leveraging Bayesian optimization. We systematically explore ways to transfer mixing ratios learned in small-scale experiments to large-scale ones, and utilize multi-fidelity Bayesian optimization to balance experimental cost and model fitness. We conduct pretraining and directive fine-tuning experiments across a range of models and benchmarks, ranging from 1 million to 7 billion parameters, demonstrating speedups of up to 500% compared to existing methods. Furthermore, we make the ADMIRE IFT Runs dataset, containing 460 full training and evaluation runs across various model sizes, available to facilitate research.

Takeaways, Limitations

Takeaways:
We significantly improved the efficiency of large-scale language model learning by proposing a method for selecting data mixing ratios based on Bayesian optimization.
We propose a method to effectively transfer the results of small-scale experiments to large-scale experiments.
We show that multi-fidelity Bayesian optimization can effectively control the trade-off between experimental cost and model performance.
The release of the ADMIRE IFT Runs dataset lowers the barrier to entry for related research.
Consistently excellent performance was achieved across a variety of model sizes and architectures.
Limitations:
Further verification of the generalization performance of the proposed method is required.
Dependence on specific datasets and model architectures cannot be completely ruled out.
The computational cost of Bayesian optimization can still be significant.
👍