This paper presents a methodology for aligning large-scale language models (LLMs) to human-like values. Specifically, we interpret the representative methods, Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), from the perspective of mutual information (MI) maximization. This demonstrates their connection to contrastive learning and analyzes their use of the Donsker-Varadhan (DV) lower bound, a MINE estimator. Furthermore, we propose Mutual Information Optimization (MIO), which replaces the DV/MINE lower bound with the Jensen-Shannon (JS) MI estimator. Through theoretical analysis and experiments, we demonstrate that MIO mitigates the late-stage performance degradation seen in DPO and achieves competitive performance on various inference and mathematical benchmarks.