Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Control Illusion: The Failure of Instruction Hierarchies in Large Language Models

Created by
  • Haebom

Author

Yilin Geng, Haonan Li, Honglin Mu, Xudong Han, Timothy Baldwin, Omri Abend, Eduard Hovy, Lea Frermann

Outline

This paper addresses the lack of a systematic understanding of effective control mechanisms for large-scale language models (LLMs) that employ hierarchical instruction hierarchies (e.g., system-level instructions take precedence over user messages). We present a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies. Experiments on six state-of-the-art LLMs reveal that the models struggle to consistently enforce instruction prioritization even in simple formal conflicts. The widely used system/user prompt separation fails to establish a reliable instruction hierarchy, and the models exhibit strong inherent biases toward certain constraint types, regardless of prioritization. LLMs tend to more reliably follow constraints constructed through natural social hierarchies (e.g., authority, expertise, consensus) than system/user roles, suggesting that pre-trained social structures can act as potential control priors, exerting a stronger influence than post-training safeguards.

Takeaways, Limitations

Takeaways: We presented a systematic evaluation framework for LLM's ability to process hierarchical instructions. We found that LLM follows constraints based on social hierarchy better than constraints based on system/user roles, demonstrating the influence of the social structure inherent in LLM's pre-training data. This provides important insights for improving LLM design and control strategies.
Limitations: This study is based on experimental results limited to a specific LLM and constraint type, so further research is needed to determine its generalizability. More complex and diverse instruction hierarchies and contexts are needed, and further analysis is needed to determine the influence of factors other than social hierarchy.
👍