Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

The State of Large Language Models for African Languages: Progress and Challenges

Created by
  • Haebom

Author

Kedir Yassin Hussen, Walelign Tewabe Sewunetie, Abinew Ali Ayele, Sukairaj Hafiz Imam, Shamsuddeen Hassan Muhammad, Seid Muhie Yimam

Outline

This paper analyzes the applicability of large-scale language models (LLMs) to 2,000 under-resourced African languages. We compare and analyze six LLMs, eight small-scale language models (SLMs), and six specialized SLMs (SSLMs) to assess the current state of African language support, training datasets, technical limitations, script issues, and language modeling roadmaps. Our analysis shows that while 42 African languages are supported and 23 public datasets exist, more than 98% of African languages are not yet supported, and only four languages (Amharic, Swahili, Afrikaans, and Malagasy) are primarily processed. In addition, we show that only Latin, Arabic, and Ge’ez scripts are recognized, while 20 active scripts are ignored. Major challenges include data shortage, tokenization bias, high computational cost, and evaluation issues.

Takeaways, Limitations

Takeaways: Reveals the serious lack of LLM applications for African low-resource languages, and suggests the need for language standardization, community-based corpus development, and effective adaptation methods for African languages. Provides an overview of the current state of supported languages and datasets, and suggests future research directions.
Limitations: The number of models analyzed may be limited and may not fully reflect all African languages and their diversity. Further review of the objectivity and generalizability of the evaluation criteria is needed.
👍