This paper focuses on the fact that inference language models that enable multi-level inference via reinforcement learning have achieved state-of-the-art performance on many benchmarks, but, like traditional language models, suffer from hallucinations, which confidently give incorrect answers. In order to safely deploy inference models in real-world applications, it is important to understand the reliability of the model. Therefore, in this paper, we explore uncertainty quantification of inference models and answer three questions: whether inference models are calibrated, how deeper inference affects model calibration, and whether calibration can be improved by explicitly inferring the inference process. To this end, we introduce introspective uncertainty quantification (UQ) and evaluate state-of-the-art inference models on various benchmarks. Our experimental results show that inference models are generally overconfident, especially when the self-verbalized confidence estimates for incorrect answers often exceed 85%, and that while deeper inference exacerbates overconfidence, introspection can sometimes improve calibration. Finally, we design essential UQ benchmarks and suggest important research directions for improving the calibration of inference models.