Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Overcoming Data Scarcity in Generative Language Modeling for Low-Resource Languages: A Systematic Review

Created by
  • Haebom

Author

Josh McGiff, Nikola S. Nikolov

Outline

This paper presents the first systematic review of strategies for addressing the data shortage problem in generative language modeling for low-resource languages (LRLs). Drawing on 54 studies, we identify, categorize, and evaluate technical approaches, including monolingual data augmentation, backtranslation, multilingual learning, and prompt engineering, across generative tasks. We also analyze trends in architecture choices, language family representations, and evaluation methods. We conclude by highlighting the strong reliance on transformer-based models, the focus on a small number of LRLs, and the lack of consistent evaluation across studies, and we make recommendations for extending these methods to a wider range of LRLs and outline the unmet challenges of building fair generative language systems. Ultimately, this review aims to support researchers and developers in building comprehensive AI tools for low-resource language users.

Takeaways, Limitations

Takeaways: By systematically reviewing and analyzing various technical approaches to solve the data shortage problem in generative language modeling for low-resource languages, we suggest research directions in the field. We evaluate the effectiveness of multilingual learning and data augmentation techniques, and suggest directions for future research. It can contribute to building comprehensive AI tools for low-resource language users.
Limitations: High reliance on transformer-based models, limited LRL analysis, lack of consistent evaluation criteria across studies. Research on more diverse LRL and generative tasks is needed.
👍