Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition

Created by
  • Haebom

Author

Dasol Choi, Seunghyun Lee, Youngsook Song

Outline

In this paper, we present a new benchmark dataset, the Visual Emergency Recognition Dataset (VERI), to evaluate the reliability of vision-language models (VLMs) in safety-critical everyday life scenarios. VERI contains 200 images, each of which is paired with a visually similar but safe image. We evaluate 14 VLMs (with parameters ranging from 2B to 124B) using a two-step evaluation protocol (hazard identification and emergency response) covering medical emergencies, accidents, and natural disasters. We find that while the models accurately identify true emergencies (70-100% success rate), they also exhibit a high false positive rate, which is an “overreaction problem”. The rate of misclassifying safe scenarios as unsafe scenarios ranges from 31-96%, and 10 safe scenarios are consistently misclassified by all models, regardless of model size. This “better safe” bias is mainly due to over-interpretation of the situation (88-93% error), which raises concerns about the reliability of VLMs in safety-critical applications. In conclusion, this study highlights the need for strategies to improve situation inference in ambiguous visual situations.

Takeaways, Limitations

Takeaways:
It highlights serious reliability concerns of VLMs in safety-critical applications.
Identifying the “overreaction problem” of VLMs and their tendency to overinterpret situations.
Emphasizes the importance of improving situational reasoning skills in ambiguous visual situations.
We demonstrate that the VERI dataset is an effective tool for diagnosing the performance of VLM.
Limitations:
The VERI dataset is relatively small (200 images).
The types of VLMs used in the evaluation may be limited.
There may be a lack of in-depth analysis of the root causes of the “overreaction problem.”
👍