Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages?

Created by
  • Haebom

Author

Senyu Li, Jiayi Wang, Felermino DMA Ali, Colin Cherry, Daniel Deutsch, Eleftheria Briakou, Rui Sousa-Silva, Henrique Lopes Cardoso, Pontus Stenetorp, David Ifeoluwa Adelani

Outline

To address the challenges of machine translation (MT) quality assessment for low-resource African languages, this study introduces a large-scale human-annotated MT evaluation dataset (SSA-MTE) covering 14 African language pairs. SSA-MTE contains over 73,000 sentence-level annotations from the news domain, and we develop improved reference-based and reference-free evaluation metrics, SSA-COMET and SSA-COMET-QE, based on this dataset. We also benchmark prompt-based approaches using state-of-the-art LLMs such as GPT-4o, Claude-3.7, and Gemini 2.5 Pro. Experimental results show that the SSA-COMET model significantly outperforms AfriCOMET and is competitive with Gemini 2.5 Pro, particularly for low-resource languages such as Twi, Luo, and Yoruba. All resources used in this study are released under an open license.

Takeaways, Limitations

Takeaways:
Contributed to African language MT evaluation research by building a large-scale human-annotated dataset (SSA-MTE).
Development of improved evaluation metrics such as SSA-COMET and SSA-COMET-QE.
Performance benchmarking of LLMs such as GPT-4o, Claude-3.7, and Gemini 2.5 Pro, and comparative analysis with SSA-COMET.
Demonstrating the superior performance of SSA-COMET in low-resource languages such as Twi, Luo, and Yoruba.
Contributing to research activation by providing open licenses for research results.
Limitations:
Data limited to the news domain.
The LLM-based approach still isn't the best performing (compared to the Gemini 2.5 Pro).
Dependency on a specific LLM.
More language pairs and domain extensions are needed in the future.
👍