Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Enhancing Visual Reliance in Text Generation: A Bayesian Perspective on Mitigating Hallucination in Large Vision-Language Models

Created by
  • Haebom

Author

Nanxing Hu, Xiaoyue Duan, Jinchao Zhang, Guoliang Kang

Outline

This paper proposes a novel method to address the hallucination problem encountered in large-scale vision-language models (LVLMs). LVLMs generate contextually consistent text, but exhibit hallucinations that are inconsistent with the visual input, hindering their practical applications. Existing research has focused on improving the features or output of specific modalities (visual or textual), but has failed to explicitly and systematically enhance visual dependency. This paper comprehensively investigates factors that reduce visual dependency during LVLM text generation from a Bayesian perspective. Based on this analysis, we propose three approaches to mitigate the hallucination problem. First, because not all visual tokens are beneficial for generating meaningful text, we remove unnecessary visual tokens to prevent interference. Second, because LVLMs can generate unexpected words by encoding irrelevant prior information, we modify the prior information from a Bayesian perspective. Third, because the posterior probability of token predictions conditioned on visual tokens can collapse to a prior distribution that does not depend on any beneficial visual tokens, we stop generating additional text to avoid hallucinations. Through extensive experiments on three benchmarks: POPE, CHAIR, and MME, we demonstrate that the proposed method consistently mitigates the hallucination problem of LVLM and outperforms existing state-of-the-art techniques.

Takeaways, Limitations

Takeaways:
We systematically analyze the hallucination problem of LVLM from a Bayesian perspective and propose an effective method to improve visual dependence.
We have contributed to alleviating the problem of hallucination through a three-pronged approach: removing unnecessary visual information, modifying prior information, and stopping generation.
It demonstrates its practical effectiveness by outperforming existing state-of-the-art models in three benchmarks.
Limitations:
The effectiveness of the proposed method may be limited to certain benchmark datasets. Additional experiments on various datasets and LVLM architectures are required.
Analyses and methodologies based on a Bayesian perspective can be computationally expensive. Efficient implementation strategies are needed for real-time applications.
Analysis of the causes of hallucinations is limited to a Bayesian perspective and may require additional analysis from other perspectives.
👍