Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

Created by
  • Haebom

Author

Alejandro Hern andez-Cano, Alexander H agele, Allen Hao Huang, Angelika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard Frank \v{D}urech, Ido Hakimi, Juan Garc ia Giraldo, Mete Ismayilzada, Negar Foroutan, Skander Moalla, Tiancheng Chen, Vinko Sabol\v{c}ec, Yixuan Xu, Michael Aerni, Badr AlKhamissi, Ines Altemir Marinas, Mohammad Hossein Amani, Matin Ansaripour, Ilia Badanin, Harold Benoit, Emanuela Boros, Nicholas Browning, Fabian Bosch , Maximilian B other, Niklas Canova, Camille Challier, Clement Charmillot, Jonathan Coles, Jan Deriu, Arnout Devos, Lukas Drescher, Daniil Dzenhaliou, Maud Ehrmann, Dongyang Fan, Simin Fan, Silin Gao, Miguel Gila, Mar ia Grandury, Diba Hashemi, Alexander Hoyle, Jiaming Jiang, Mark Klein, Andrei Kucharavy, Anastasiia Kucherenko, Frederike L ubeck, Roman Machacek, Theofilos Manitaras, Andreas Marfurt, Kyle Matoba, Simon Matrenok, Henrique Mendonc\c{c}a, Fawzi Roberto Mohamed, Syrielle Montariol, Luca Mouchel, Sven Najem-Meyer, Jingwei Ni, Gennaro Oliva, Matteo Pagliardini, Elia Palme, Andrei Panferov, Leo Paoletti, Marco Passerini, Ivan Pavlov, Auguste Poiroux, Kaustubh Ponkshe, Nathan Ranchin, Javi Rando, Mathieu Sauser, Jakhongir Saydaliev, Muhammad Ali Sayfiddinov, Marian Schneider, Stefano Schuppli, Marco Scialanga, Andrei Semenov, Kumar Shridhar, Raghav Singhal, Anna Sotnikova, Alexander Sternfeld, Ayush Kumar Tarun, Paul Teiletche, Jannis Vamvas, Xiaozhe Yao, Hao Zhao Alexander Ilic, Ana Klimovic, Andreas Krause, Caglar Gulcehre, David Rosenthal, Elliott Ash, Florian Tram er, Joost VandeVondele, Livio Veraldi, Martin Rajman, Thomas Schulthess, Torsten Hoefler, Antoine Bosselut, Martin Jaggi, Imanol Schlag.

Outline

Apertus is a suite of fully open-source large-scale language models (LLMs) designed to address two systemic shortcomings: data compliance and multilingual representation. Unlike many existing models that release only weights without a reproducible data pipeline or without considering content owner rights, the Apertus models are pre-trained on fully publicly available data, retroactively applying robots.txt exclusions and filtering out impermissible, harmful, and personally identifiable content. To mitigate the risk of memorization, the Goldfish objective is employed during pre-training to strongly suppress direct data reproducibility while maintaining downstream task performance. The Apertus models also extend multilingual support by training with 15T tokens across over 1,800 languages, allocating approximately 40% of the pre-training data to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, matching or surpassing publicly available weighted models. In addition to model weights, all scientific output from the development cycle, including data preparation scripts, checkpoints, evaluation suites, and training code, is made publicly available under a permissive license, enabling transparent auditing and extension.

Takeaways, Limitations

Takeaways:
We offer a fully open LLM with data compliance and multilingual support in mind.
It demonstrates attempts to mitigate ethical concerns through robots.txt exclusions and filtering of harmful content.
Reduce memory leak risk and maintain performance with Goldfish goals.
Achieve cutting-edge performance in multilingual benchmarks.
Increase transparency and reproducibility by making all scientific outputs public.
Limitations:
Further analysis of the effectiveness of Goldfish goals may be required.
There may be a lack of detailed information about the quality and bias of the data used.
Performance is presented only for a specific multilingual benchmark, so performance on other benchmarks is unknown.
The model sizes are limited to 8B and 70B, so the performance of larger models is unknown.
👍