Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios

Created by
  • Haebom

Author

Yunkai Dang, Mengxi Gao, Yibo Yan, Xin Zou, Yanggan Gu, Jungang Li, Jingyu Wang, Peijie Jiang, Aiwei Liu, Jia Liu, Xuming Hu

Outline

This paper explores the error vulnerability of multimodal large-scale language models (MLLMs), specifically the phenomenon of response uncertainty to misinformation. Using nine standard datasets and twelve state-of-the-art open-source MLLMs, the researchers found that a single misleading cue resulted in a 65% reversal rate of previously correct answers. To quantitatively analyze this, we proposed a two-stage evaluation pipeline (validating the original response and measuring the error rate after injecting the misleading directive) and created a Multimodal Uncertainty Benchmark (MUB) by collecting examples with high error rates. Extensive evaluations across twelve open-source and five closed-source models revealed an average error rate exceeding 86%, with 67.19% for explicit cues and 80.67% for implicit cues. Finally, we fine-tuned open-source MLLMs on a mixed-direction dataset of 2000 samples, significantly reducing error rates (6.97% for explicit cues and 32.77% for implicit cues).

Takeaways, Limitations

Takeaways:
We systematically investigated the error vulnerability of MLLM and its uncertainty in response to misinformation.
A new benchmark (MUB) was proposed to improve the reliability of MLLM.
We show that the error rate of MLLM can be significantly reduced through fine-tuning.
By analyzing the vulnerability of MLLM to various types of misleading information and suggesting ways to mitigate it, we can contribute to improving the safety and reliability of MLLM in practical applications.
Limitations:
Current benchmark and fine-tuning datasets are focused on specific types of errors, which may limit their generalizability to other types of errors.
Even after fine-tuning, the error rate for implicit cues is still quite high.
Since the models used are limited to open source, further research is needed to determine the generalizability of the results to commercial models.
The relatively small size of the dataset used in the fine-tuning process to reduce the error rate can be pointed out as a limitation.
👍