Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs

Created by
  • Haebom

Author

Hao Wang, Pinzhi Huang, Jihan Yang, Saining Xie, Daisuke Kawahara

Outline

This paper highlights that despite rapid progress in practical applications of multilingual large-scale language models (MLLMs), achieving consistent performance across languages remains a significant challenge, especially when incorporating cultural knowledge. To better assess this issue, the researchers present two new benchmarks: KnowRecall and VisRecall. KnowRecall is a visual question answering benchmark designed to measure factual knowledge consistency across 15 languages, focusing on cultural and historical questions about global landmarks. VisRecall assesses visual memory consistency by having the model describe the appearance of landmarks in nine languages without access to the images. Experimental results demonstrate that even state-of-the-art MLLMs, including proprietary models, struggle to achieve cross-language consistency. This highlights the need for a more robust approach to generating truly multilingual and culturally aware models.

Takeaways, Limitations

Takeaways: We highlight the cross-linguistic consistency issues in multilingual large-scale language models and propose new benchmarks (KnowRecall and VisRecall) to evaluate them, suggesting future research directions. By clearly demonstrating the limitations of state-of-the-art MLLMs, we emphasize the importance of developing truly multilingual and culturally aware models.
Limitations: The presented benchmark focuses on a specific domain (world landmarks), making it difficult to generalize MLLM performance to other domains or types of questions. The benchmark's language coverage may also be limited. Furthermore, detailed information on the performance of proprietary models may be limited.
👍