Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Hakim: Farsi Text Embedding Model

Created by
  • Haebom

Author

Mehran Sarmadi, Morteza Alikhani, Erfan Zinvandi, Zahra Pourbahman

Outline

This paper aims to advance Persian text embedding research and presents Hakim, a novel Persian text embedding model that achieves an 8.5% performance improvement over existing approaches. Hakim outperforms existing Persian models on the FaMTEB benchmark and introduces three new datasets (Corpesia, Pairsia-sup, and Pairsia-unsup) for supervised and unsupervised learning. Furthermore, it is designed to be suitable for retrieval tasks that integrate message history within chatbots and augmented search generation (RAG) systems. A new baseline model based on the BERT architecture is also proposed, demonstrating higher accuracy on several Persian NLP tasks. A RetroMAE-based model has proven particularly effective for text information retrieval applications.

Takeaways, Limitations

Takeaways:
The Hakim model contributes to the advancement of Persian NLP by outperforming existing models by 8.5% on the FaMTEB benchmark.
The new datasets (Corpesia, Pairsia-sup, Pairsia-unsup) provide a rich resource for training Persian models.
It suggests potential applications in chatbots and RAG systems, and is particularly strong in search tasks that integrate message records.
The development of a new baseline model based on BERT and a RetroMAE-based model presents a novel approach to various Persian NLP tasks.
Limitations:
The specific Limitations is not specified in the paper. (Only the summary information is included)
👍