Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning

Created by
  • Haebom

Author

Jaedong Hwang, Kumar Tanmay, Seok-Jin Lee, Ayush Agrawal, Hamid Palangi, Kumar Ayush, Ila Fiete, Paul Pu Liang

Outline

Large-scale language models (LLMs) excel in tasks such as mathematics, factual question answering, and code generation, but their ability to perform these tasks across multiple languages remains underdeveloped. Especially in low-resource languages like Swahili or Thai, LLMs often misinterpret prompts or infer in English. This implicit bias toward high-resource languages hinders factual accuracy, interpretability, and reliability. In this paper, we propose M2A, a novel method that combines multi-scale multilingual alignment with language consistency compensation for machine-translated questions to train models to directly and accurately infer in the target language. Furthermore, existing multilingual benchmarks only evaluate final answers, overlooking whether inferences occur in the intended language. To address this gap, we introduce GeoFact-X, a geography-based multilingual factual inference benchmark, with inference traces in English, Hindi, Japanese, Swahili, and Thai. Consequently, M2A significantly improves multilingual reasoning fidelity in both mathematical and factual reasoning tasks, highlighting the importance of inference-aware multilingual reinforcement learning for robust cross-language generalization.

Takeaways, Limitations

Improving multilingual reasoning capabilities through the M2A methodology.
Setting a new standard for evaluating multilingual reasoning with the GeoFact-X benchmark.
Emphasizes the importance of inference-aware multilingual reinforcement learning.
Suggesting the possibility of improving the performance of LLM for low-resource languages.
Further research is needed to determine how well the methodology presented in this paper generalizes to other languages and tasks.
The current benchmark is limited to five languages, so evaluations in more languages are needed.
👍