Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

VN-MTEB: Vietnamese Massive Text Embedding Benchmark

Created by
  • Haebom

Author

Loc Pham, Tung Luu, Thu Vo, Minh Nguyen, Viet Hoang

Outline

This paper presents VN-MTEB, a large-scale benchmark dataset for evaluating Vietnamese embedding models. Vietnam's high internet usage and prevalence of online toxicity make embedding models crucial. However, to address the lack of a suitable evaluation dataset, we translated the existing English Massive Text Embedding Benchmark (MTEB) into Vietnamese. Leveraging large-scale language models (LLMs) and state-of-the-art embedding models, we achieved high-quality translation and filtering, preserving natural language flow and semantic accuracy, even preserving Named Entity Recognition (NER) and code fragments. Finally, we present VN-MTEB, a dataset comprised of 41 datasets across six tasks. Analysis results show that large, complex models using Rotary Positional Embeddings outperform models using Absolute Positional Embeddings. The dataset is publicly available on HuggingFace.

Takeaways, Limitations

Takeaways:
We provide the first large-scale, diverse benchmark dataset for evaluating Vietnamese embedding models.
We present an effective dataset construction method utilizing LLM and state-of-the-art embedding models.
It provides important criteria that can contribute to improving the performance of Vietnamese embedding models.
We empirically demonstrate the superiority of Rotary Positional Embedding.
Limitations:
Because it relies on MTEB translation, biases in the original dataset may also affect VN-MTEB.
Additional verification may be required to address any semantic losses or errors that may arise during the translation process.
It may not perfectly reflect the various dialects or lexical features of Vietnamese.
Caution is needed in generalizing, as results may vary depending on the type and parameters of the model used for evaluation.
👍