Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation

Created by
  • Haebom

Author

Mohammed Khalil, Mohammed Sabry

Outline

This paper introduces ATHAR, a large-scale, high-quality dataset for English translation of classical Arabic literature. It highlights the importance of classical Arabic literature and the need for translation, while addressing the limitations of existing, limited datasets. The ATHAR dataset comprises 6,600 high-quality translation samples spanning diverse fields, including science, culture, and philosophy. It demonstrates the necessity and applicability of this dataset through performance evaluations of state-of-the-art large-scale language models (LLMs). It is publicly available on the HuggingFace Data Hub.

Takeaways, Limitations

Takeaways: This provides a large, high-quality dataset essential for classical Arabic translation research, contributing to the performance improvement of LLM-based translation systems. The dataset's comprehensive coverage of various fields increases accessibility to classical Arabic literature and contributes to the dissemination of knowledge. It also suggests potential applications for fine-tuning and pre-training of LLMs.
Limitations: The dataset may not yet be sufficiently large, and further analysis may be needed to identify qualitative biases or imbalances in the dataset. Consideration should be given to expanding translations into other languages.
👍