Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs

Created by
  • Haebom

Author

Mehdi Ali, Michael Fromm, Klaudia Thellmann, Jan Ebert, Alexander Arno Weber, Richard Rutmann, Charvi Jain, Max Lubbering, Daniel Steinigen, Johannes Leveling, Katrin Klug, Jasper Schulze Buschhoff, Lena Jurkschat, Hammam Abdelwahab, Benny J org Stein, Karl-Heinz Sylla, Pavel Denisov, Nicolo' Brandizzi, Qasid Saleem, Anirban Bhowmick, Lennard Helmer, Chelsea John, Pedro Ortiz Suarez, Malte Ostendorff, Alex Jude, Lalith Manjunath, Samuel Weinbach, Carolin Penke, Oleg Filatov, Fabio Barth, Paramita Mirza, Lucas Weber, Ines Wendler, Rafet Sifa, Fabian K uch, Andreas Herten, Ren J akel, Georg Rehm, Stefan Kesselheim, Joachim K ohler, Nicolas Flores-Herr

Outline

We present two multilingual large-scale language models (LLMs), Teuken 7B-base and Teuken 7B-instruct. These models are designed to embrace Europe's linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprised of approximately 60% non-English data, they utilize custom multilingual tokenizers to address the limitations of existing LLMs, which focus on English or a small number of high-resource languages. We detail the model development principles, including data organization, tokenizer optimization, and training methodology. We demonstrate robust performance on multilingual benchmarks, demonstrating performance on the European versions of ARC, HellaSwag, and TruthfulQA.

Takeaways, Limitations

Takeaways: Presents a successful case study of developing a multilingual LLM model that supports all languages of the European Union. This model contributes to resolving the language bias problem of existing LLM models by focusing on non-English data. The model's practicality is demonstrated through excellent performance on multilingual benchmarks.
Limitations: Lack of detailed information on the specific dataset construction and tokenizer optimization process. Lack of performance analysis for specific languages. Lack of comparative analysis with other multilingual LLMs. Lack of discussion of potential model bias and ethical issues.
👍