Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs

Created by
  • Haebom

Author

Ammar Khairi, Daniel D'souza, Ye Shen, Julia Kreutzer, Sara Hooker

Outline

In this paper, we study how to efficiently scale inference time calculations for open-ended generative tasks in multilingual, multitask environments. While previous studies have focused on a few domains such as English, mathematics, and code, our study focuses on techniques that are open-ended, formally verifiable, and generalizable across multiple languages. We show that temperature-based sampling and selection strategies should be tailored to different domains and language settings. We find that existing selection methods that are effective in English do not generalize to other languages, and propose novel sampling and selection strategies that are tailored to multilingual and multitask inference scenarios. The proposed method achieves significant performance improvements across a variety of languages and tasks, and in particular, it improves the win rate of an 8B model by +6.8 on m-ArenaHard-v2.0 prompts by an average of +9.0 on a 111B model, Command-A, compared to single-sample decoding with only 5 samples. This highlights the need for language- and task-aware approaches to improve performance in low-resource languages.

Takeaways, Limitations

Takeaways:
We present a novel sampling and selection strategy to efficiently scale inference time computation in multilingual, multi-task environments.
We experimentally demonstrate that the proposed method achieves significant performance improvements across a variety of languages and tasks.
Emphasizes the importance of language- and task-aware approaches to improving the performance of low-resource languages.
Shows that significant performance improvements can be achieved at minimal cost (e.g. +9.0 win rate improvement using 5 samples).
Limitations:
Further research is needed on the generalization performance of the proposed method.
Since these are evaluation results for specific benchmarks and models, generalizability to other benchmarks and models needs to be verified.
Lack of detailed information on the proprietary models used (e.g. Gemini).
👍