This paper introduces a variational inference framework for language models that treats thought processes as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-track objective function for tighter bounds and propose a forward-KL formulation that stabilizes the training of variational posterior probabilities. Furthermore, we demonstrate that binary reward RL, including rejection sampling fine-tuning and GRPO, can be interpreted as a localized forward-KL objective function. Here, implicit weighting by model accuracy naturally arises during the derivation process, revealing previously unrecognized biases toward easier questions. We empirically validate the proposed method on extensive inference tasks in the Qwen 2.5 and Qwen 3 family of models. Overall, this study presents a principled probabilistic perspective that integrates variational inference with RL-style methods and provides a stable objective function for enhancing the inference capabilities of language models.