This paper highlights the shortcomings of existing methods for evaluating the generative performance of multilingual large-scale language models (mLLMs) and proposes ways to improve mLLM evaluation, drawing on successful examples from the field of machine translation (MT) evaluation. We highlight the lack of systematic and rigorous criteria for evaluating the generative performance of mLLMs and the inconsistency across laboratories. We demonstrate through experiments that applying transparent reporting criteria and reliable evaluation methods from MT evaluation to mLLM evaluation can enhance understanding of inter-model quality differences. Furthermore, we present essential components of a robust meta-evaluation to ensure rigorous evaluation of the evaluation methods themselves and provide a checklist of practical recommendations for mLLM research and development.