Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Sadeed: Advancing Arabic Diacritization Through Small Language Model

Created by
  • Haebom

Author

Zeina Aldallal, Sara Chrouf, Khalil Hennara, Mohamed Motaism Hamed, Muhammad Hreden, Safwan AlModhayan

Outline

Diacritization of Arabic text remains a persistent challenge in natural language processing due to the rich morphological characteristics of the language. In this paper, we present Sadeed, a decoder-only language model fine-tuned on the Kuwain 1.5B Hennara et al. [2025], a compact model trained on a diverse Arabic corpus. Sadeed is fine-tuned on a dataset containing carefully selected, high-quality diacritized texts generated through rigorous data cleaning and normalization processes. Despite using fewer computational resources, Sadeed achieves competitive results compared to proprietary large-scale language models and outperforms existing models trained in similar domains. Furthermore, this paper highlights key shortcomings in current benchmarking practices for Arabic diacritization. To address these issues, we introduce SadeedDiac-25, a novel benchmark designed to enable more fair and comprehensive evaluation across a variety of text genres and complexity levels. Sadeed and SadeedDiac-25 provide a solid foundation for advancing Arabic NLP applications, including machine translation, speech synthesis, and language learning tools.

Takeaways, Limitations

Takeaways:
Achieve performance comparable to existing large-scale models using small-scale models, increasing computational resource efficiency.
Building high-quality datasets through rigorous data cleansing and normalization processes.
Benchmarking practices for Arabic phonetic symbol pasting tasks Limitations and presentation of a new benchmark SadeedDiac-25.
Contributed to the development of various Arabic NLP applications, including machine translation, speech synthesis, and language learning tools.
Limitations:
Although we have pointed out the Limitations of current benchmarking practices, further research is needed to determine whether SadeedDiac-25 fully addresses these Limitations.
The possibility that the performance of the Sadeed model may be biased towards a specific dataset.
Lack of detailed description of the size and diversity of the dataset used.
👍