[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

Created by
  • Haebom

Author

Mar ia Andrea Cruz Bland on, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico

Outline

In this paper, we present the Multilingual End-to-End Meta-evaluation RAG Benchmark (MEMERAG). Existing automated RAG systems have limitations in that they are English-centric or use translated data, which fails to properly reflect cultural nuances. MEMERAG is built on the MIRACL dataset, using multiple large-scale language models (LLMs) to generate responses to native-language questions in each language, and then evaluated by experts for reliability and relevance. This paper presents the annotation process, high inter-annotator agreement, performance analysis of LLMs for various languages, and benchmarking results of a multilingual automated evaluator (LLM-as-a-judge). We demonstrate that improved prompting techniques and performance improvements in LLMs can be reliably identified, and the dataset is made publicly available on GitHub.

Takeaways, Limitations

Takeaways:
A new benchmark (MEMERAG) for automated evaluation of multilingual RAG systems is presented.
A more realistic RAG system evaluation that takes cultural nuances into account is possible.
We present an improved prompting technique and a method to reliably evaluate the performance improvement of LLM.
Comparative analysis of LLM performance across multiple languages.
Contribute to future research through open datasets.
Limitations:
Since it is based on the MIRACL dataset, limitations of the dataset may also affect MEMERAG.
Because it relies on expert annotations, annotation costs and time can be high.
There is a possibility that the assessment results are limited to specific LLMs and prompting techniques.
There is a possibility that other important aspects other than evaluation criteria (reliability and relevance) were not considered.
👍