By analyzing implementations of the KL divergence loss in RLHF, we propose a unified framework that bridges the two implementation styles of "k_n as reward" and "k_n as loss." This framework illuminates the principle of Reverse KL (RKL) regularization and proves that "k_2 as loss" is gradient-equivalent to "k_1 in reward" under on-policy conditions. Furthermore, we show that "k_3 as loss" is a biased approximation and propose a method to correct the bias that can arise in off-policy implementations.