In this paper, we propose a novel adaptive preference optimization algorithm, called α-DPO, to overcome the limitations of reinforcement learning-based human feedback (RLHF) with computational efficiency and training stability issues as a way to align large-scale language models (LLMs) to human values and intentions. α-DPO introduces a dynamic reward margin to reduce the dependence on the optimal reference model and solve the problem of making suboptimal decisions under diverse data settings. It achieves personalized reward margins by balancing the policy model and the reference model using adaptive preference distributions. Through theoretical guarantees and experimental evaluations on AlpacaEval 2 and Arena-Hard, we demonstrate that α-DPO outperforms DPO and SimPO, demonstrating it as a powerful tool for LLM alignment.