This paper presents a novel approach to generating and solving complex, PhD-level probabilistic questions by leveraging multiple large-scale language models, including GPT-4, Meta-LLAMA, Claude, and Gemini. Instead of traditional, correct-answer-based evaluation methods, we assess the reliability of answers and the quality of questions based on the level of agreement between the various models. We analyze the agreement and accuracy between models using statistical evaluations such as the chi-square test, Fleiss' Kappa coefficient, and confidence interval calculations. Our analysis reveals that Claude and Gemini tend to generate clearer and more unambiguous questions, while LLAMA generates less consistent questions. This suggests that a multi-model collaboration strategy is effective in enhancing answer reliability and assessing and improving question quality even in situations where no correct answer is available. This study provides actionable insights into improving AI-based inference processes through coordinated interactions between heterogeneous language models.