This paper addresses the shortcomings of evaluation methodology and inconsistency in password guessing research using generative models. To address these shortcomings, we present MAYA, an integrated and customizable benchmarking framework. Using MAYA, we comprehensively evaluated six state-of-the-art generative password guessing models on eight real-world password datasets, investing over 15,000 computing hours. The evaluation results demonstrate that the generative models effectively capture various aspects of human password distributions and exhibit excellent generalization capabilities. However, their effectiveness on long and complex passwords varied significantly across models. In particular, sequential models outperformed other generative architectures and existing password guessing tools, and multi-model attacks combining multiple models trained on diverse password distributions outperformed individual models. MAYA is publicly available, which is expected to facilitate research toward continuous and reliable benchmarking of generative password guessing models.