Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Automating Steering for Safe Multimodal Large Language Models

Created by
  • Haebom

Author

Lyucheng Wu, Mengru Wang, Ziwen Xu, Tri Cao, Nay Oo, Bryan Hooi, Shumin Deng

Outline

This paper proposes AutoSteer, an inference-time arbitration technique for improving the safety of multimodal large-scale language models (MLLMs). AutoSteer consists of three core components: a Safety Awareness Score (SAS), an adaptive safety explorer, and a lightweight rejection head, all without fine-tuning the underlying model. The SAS automatically identifies safety-related differences between layers within the model, the adaptive safety explorer estimates the likelihood of harmful outputs from intermediate representations, and the rejection head selectively adjusts output when safety risks are detected. Experimental results using LLaVA-OV and Chameleon demonstrate that AutoSteer significantly reduces the attack success rate (ASR) against textual, visual, and multimodal threats while maintaining general functionality. Therefore, AutoSteer can be established as a practical, interpretable, and effective framework for the secure deployment of multimodal AI systems.

Takeaways, Limitations

Takeaways:
Presenting an effective inference point mediation technique to address the safety issues of existing MLLMs.
Safety can be improved without model fine-tuning.
Validated effectiveness in improving security against text, visual, and multi-modal threats.
Providing an interpretable and practical safety framework
Limitations:
Only experimental results for specific MLLMs (LLaVA-OV, Chameleon) and safety benchmarks are presented, requiring further research on generalizability.
Further research is needed to improve and optimize the performance of SAS, adaptive safety explorers, and reject heads.
Further verification of applicability and safety in real-world environments is needed.
👍