Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding

Created by
  • Haebom

Author

Shunqi Mao, Chaoyi Zhang, Weidong Cai

Outline

Existing vision-language models (VLMs) suffer from visual hallucination, a phenomenon in which generated responses contain inaccuracies unrelated to the visual input. Attempts to address this issue without model fine-tuning primarily mitigate hallucination by reducing linguistic biases in contrast or by amplifying the weights of visual embeddings during decoding. However, these approaches are limited in their ability to capture fine visual details. In this study, we propose Perception Magnifier (PM), a novel visual decoding method that iteratively isolates relevant visual tokens and magnifies these regions based on attention mechanisms, thereby guiding the model to focus on fine visual details during decoding. PM enhances the VLM's scrutiny of visual inputs by magnifying critical regions while preserving structural and contextual information at each decoding step, enabling it to generate more accurate and faithful responses. Extensive experimental results demonstrate that PM not only mitigates hallucination but also enhances language production while maintaining robust inference capabilities.

Takeaways, Limitations

Takeaways:
We present a novel visual decoding method (PM) that effectively alleviates visual hallucination problems by capturing fine visual details.
Experimentally demonstrated superior hallucination mitigation performance and improved language generation ability compared to existing methods.
Successfully increased visual accuracy while maintaining strong reasoning capabilities.
Limitations:
The possibility that PM's performance improvements may be limited to specific datasets or model architectures.
Further research is needed on generalization abilities to more complex and diverse visual environments.
Potential increase in computational costs.
👍