This paper highlights that reliable safety planning for advanced AI development requires methods to verify agent behavior and detect potential control failures early, and that ensuring that agents adhere to safety-critical principles when they conflict with operational goals is a fundamental aspect. To this end, we present a lightweight, interpretable benchmark that evaluates the ability of LLM agents to adhere to high-level safety principles when faced with conflicting task instructions. Evaluating six LLMs, we find two main findings: (1) compliance costs (safety constraints degrade task performance even when a compliant solution exists) and (2) the illusion of compliance (high compliance often masks task incompetence rather than principled choices). These results provide initial evidence that while LLMs can be influenced by hierarchical instructions, current approaches lack the consistency necessary for reliable safety governance.