Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Created by
  • Haebom

Author

Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, Jing Shao

Outline

Flawed planning in materialized agents based on large-scale language models (VLMs) poses serious safety risks that hinder deployment in real-world scenarios. Existing static, non-interactive assessment paradigms fail to adequately assess risks within these interactive environments because they cannot simulate the dynamic risks arising from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To address this critical gap, this paper proposes a method to assess an agent's interaction safety—its ability to recognize emerging hazards and execute mitigation steps in the correct procedural sequence. Therefore, we present IS-Bench, the first multimodal interaction safety benchmark featuring 161 challenging scenarios involving 388 unique safety hazards implemented in a high-fidelity simulator. Crucially, it facilitates a novel process-centric assessment that determines whether risk mitigation actions are implemented before or after a specific risky step. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, demonstrate that current agents lack interaction safety awareness, and while safety-aware chains of thought can improve performance, they often hinder task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable object-oriented AI systems. The code and data are available at this link .

Takeaways, Limitations

Takeaways:
Presentation of IS-Bench, a new benchmark for safety risk assessment in interactive environments.
Proposing a process-oriented evaluation method for interaction safety assessment.
Presentation of experimental analysis results on the interaction safety level of major VLMs.
Providing a foundation for developing safer and more reliable object-oriented AI systems.
Ensuring reproducibility and scalability of research through open code and data.
Limitations:
Currently, IS-Bench is evaluated in a high-fidelity simulator environment, so further research is needed to determine its generalizability to real-world environments.
The application of the Chain of Thought (CHI) approach to safety awareness has been shown to be problematic, potentially leading to lower task completion rates. Research is needed to develop more effective safety enhancement techniques.
Further review is needed of the diversity and generalizability of the scenarios included in the benchmark.
👍