Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

FlexOlmo: Open Language Models for Flexible Data Use

Created by
  • Haebom

Author

Weijia Shi, Akshita Bhagia, Kevin Farhat, Niklas Muennighoff, Pete Walsh, Jacob Morrison, Dustin Schwenk, Shayne Longpre, Jake Poznanski, Allyson Ettinger, Daogao Liu, Margaret Li, Dirk Groeneveld, Mike Lewis, Wen-tau Yih, Luca Soldaini, Kyle Lo, Noah A. Smith, Luke Zettlemoyer, Pang Wei Koh, Hannaneh Hajishirzi, Ali Farhadi, Sewon Min

Outline

FlexOlmo is a new type of language model that supports distributed learning and data-flexible inference. Each expert is trained independently on a closed dataset and integrated through a novel domain-based routing. Using the FlexMix corpus, consisting of public datasets and seven domain-specific closed datasets, we trained models with up to 37 billion parameters (20 billion active) and evaluated them on 31 different subtasks. By effectively combining general experts trained on public data with independently trained experts, we achieved an average performance improvement of 41%, and we can selectively exclude specific data based on data license or permission requirements. We achieved an average performance improvement of 10.1% over existing model merging methods and outperformed standard MoE using the same training FLOPs. FlexOlmo provides a solution for both data owners and researchers in regulated industries with sensitive or protected data. FlexOlmo allows data owners to reap the benefits of closed data by keeping their data local and providing fine-grained control over data access during inference.

Takeaways, Limitations

Takeaways:
Distributed learning enables language model training using various data sources without data sharing.
Data-flexible inference enables compliance with data licensing and permission requirements.
Achieving superior performance over existing model merging methods and standard MoE.
Providing practical solutions for sensitive data utilization in regulated industries.
Limitations:
Further validation is needed regarding the composition of the FlexMix corpus and its representativeness as a closed dataset.
The need to evaluate generalization performance on closed datasets of various sizes.
Further research is needed on the efficiency and scalability of domain-based routing.
Need for application and performance evaluation in actual industrial environments
👍