Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens

Created by
  • Haebom

Author

Sohee Kim, Soohyun Ryu, Joonhyung Park, Eunho Yang

Outline

This paper reveals a phenomenon in which large-scale vision-language models (LVLMs) mistakenly perceive text inputs without visual evidence as part of an image, leading to errors. By investigating the ability of LVLMs to determine whether text concepts are rooted in an image, we discovered visual absence awareness (VA) neurons, a specific subset of feedforward network (FFN) neurons that signal visual absence with a unique activation pattern. Leveraging this pattern, we develop a detection module that classifies input tokens as visually rooted. Based on this prediction, we propose a method to improve the output by reinterpreting the question prompt or replacing absent tokens detected during generation. Extensive experiments demonstrate that the proposed method effectively mitigates the model's tendency to make incorrect assumptions about visual presence and is generalizable across a variety of LVLMs.

Takeaways, Limitations

Takeaways:
Provides new insights into the visual information processing of LVLMs.
We present a novel method for detecting text input without visual evidence and improving the output.
We present a general methodology applicable to various LVLMs.
Limitations:
Further studies are needed to determine whether the activity patterns of VA neurons are the same across all LVLMs.
Further validation is needed to determine how well the proposed method generalizes to different types of images and text inputs.
Further research is needed on its performance in cases requiring complex visual reasoning.
👍