Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Murphys Laws of AI Alignment: Why the Gap Always Wins

Created by
  • Haebom

Author

Madhava Gaikwad

Outline

This paper demonstrates a formal impossibility result for Reinforcement Learning from Human Feedback (RLHF). In a poorly specified environment with a limited query budget, RLHF-style learners suffer from an unbridgeable performance gap Omega(γ) without access to a correction oracle. Information-theoretic proofs provide strict lower bounds and demonstrate that a minimal number of correction oracles is sufficient to close the gap. A small empirical example and a list of alignment rules (Murphy's Law) demonstrate that many observed alignment failures are consistent with this structural mechanism. These findings establish Murphy's gap as a diagnostic limitation of RLHF and serve as a guide for future research on correction and causal preference identification.

Takeaways, Limitations

Takeaways: We identify "Murphy's Gap," a structural limitation of RLHF, and suggest the importance of a correction oracle to address it. We provide information-theoretic evidence proving the performance limitations of RLHF in poorly specified environments, suggesting future directions for RLHF research. We also provide a new explanation for the observed alignment failures.
Limitations: Only small-scale empirical examples are presented, and further research is needed to determine the applicability and generalizability to large-scale systems. There is a lack of specific discussion on the implementation and practical application of correction oracles. The definition and scope of "Murphy's Law" are unclear, requiring further explanation.
👍