Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages

Created by
  • Haebom

Author

Rapha el Merx, Hanna Suominen, Trevor Cohn, Ekaterina Vylomova

Outline

This paper introduces a document-level parallel corpus called OpenWHO to address the lack of evaluation datasets for low-resource languages in machine translation (MT), particularly in the healthcare field. This corpus consists of expert-authored and professionally translated materials available on the World Health Organization (WHO) e-learning platform. It contains 2,978 documents and 26,824 sentences, supporting over 20 languages, nine of which are low-resource languages. Using this new resource, we evaluated state-of-the-art large-scale language models (LLMs) and traditional MT models. Our results show that LLMs consistently outperform traditional MT models, with Gemini 2.5 Flash achieving a 4.79 ChrF point improvement over NLLB-54B on the low-resource test set. Furthermore, we investigated the impact of LLM contextualization on accuracy, demonstrating the significant benefits of document-level translation in specialized fields such as healthcare. The OpenWHO corpus was made available to encourage low-resource MT research in the healthcare field.

Takeaways, Limitations

Takeaways:
We present OpenWHO, a new dataset for health MT research in low-resource languages.
We demonstrate that LLM outperforms traditional MT models in low-resource environments.
We found that document-level translation contributed to improving the performance of LLMs in the specialized field (health).
We encourage further research by providing open-source datasets to the research community.
Limitations:
The paper does not explicitly mention the specific Limitations.
👍