This paper highlights that state-of-the-art large-scale language models (LLMs) can generate different responses to the same prompt multiple times because their responses rely on internal randomization. To address this problem in LLM evaluation and ranking due to randomization, we develop a causal model for coupled autoregressive generation. This model allows different LLMs to generate responses using the same random source. Evaluation on a benchmark dataset demonstrates that coupled autoregressive generation yields the same conclusions as conventional autoregressive generation with a significantly smaller sample count. However, pairwise comparisons reveal that, surprisingly, when comparing two or more models, coupled and conventional autoregressive generation can yield different rankings. This suggests that model superiority in conventional evaluation protocols may be confounded by the randomization of the generation process rather than by their actual performance. Experiments using multiple LLMs from the Llama, Mistral, and Qwen families demonstrate that joint autoregressive generation reaches the same conclusions with up to 75% fewer samples than traditional methods, and show that pairwise win rates for prompts on the LMSYS Chatbot Arena platform vary across joint and traditional autoregressive generation methods.