In this paper, we present a new benchmark dataset, the Visual Emergency Recognition Dataset (VERI), to evaluate the reliability of vision-language models (VLMs) in safety-critical everyday life scenarios. VERI contains 200 images, each of which is paired with a visually similar but safe image. We evaluate 14 VLMs (with parameters ranging from 2B to 124B) using a two-step evaluation protocol (hazard identification and emergency response) covering medical emergencies, accidents, and natural disasters. We find that while the models accurately identify true emergencies (70-100% success rate), they also exhibit a high false positive rate, which is an “overreaction problem”. The rate of misclassifying safe scenarios as unsafe scenarios ranges from 31-96%, and 10 safe scenarios are consistently misclassified by all models, regardless of model size. This “better safe” bias is mainly due to over-interpretation of the situation (88-93% error), which raises concerns about the reliability of VLMs in safety-critical applications. In conclusion, this study highlights the need for strategies to improve situation inference in ambiguous visual situations.