Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Retrieval-Augmented Machine Translation with Unstructured Knowledge

Created by
  • Haebom

Author

Jiaan Wang, Fandong Meng, Yingxue Zhang, Jie Zhou

Outline

This paper studies Retrieval-Augmented Machine Translation (RAG-MT) using unstructured documents. While previous research has primarily improved the performance of LLMs by retrieving information from pairwise machine translation corpora or knowledge graphs, this paper focuses on leveraging the vast global knowledge available in unstructured documents across various languages. To achieve this, the researchers built a new benchmark, RAGtrans, consisting of 169,000 machine translation samples and multilingual documents, using GPT-4 and human translators. Furthermore, they propose a multi-task learning method that trains LLMs to utilize information from existing multilingual corpora without additional labeling. Experimental results demonstrate that the proposed method significantly improves BLEU and COMET scores for English-Chinese and English-German translations. Finally, we analyze the challenges faced by current LLMs in these tasks.

Takeaways, Limitations

Takeaways:
We present a new benchmark, RAGtrans, demonstrating the potential of RAG-MT using unstructured documents.
An effective multi-task learning method for leveraging multilingual document information without additional labeling is proposed.
Significant improvements in BLEU and COMET scores in English-Chinese and English-German translations.
Provides an analysis of the challenges currently faced by LLMs in RAG-MT.
Limitations:
The scale of the RAGtrans benchmark needs to be further expanded.
Further research is needed on the generalization performance of the proposed multi-task learning method.
Experimentation with different language combinations is limited.
A more detailed analysis of the challenges LLMs face in RAG-MT is needed.
👍