Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Revisit What You See: Disclose Language Prior in Vision Tokens for LVLM Decoding

Created by
  • Haebom

Author

Beomsik Cho, Jaehyung Kim

ReVisiT: A Training-Free Decoding Method for Enhancing Visual Grounding in Large Vision-Language Models

Outline

Large-scale vision-language models (LVLMs) integrate visual perception and language understanding, demonstrating powerful performance across a variety of multimodal tasks. However, the contribution of visual information to the model's decoding process remains understudied, as evidenced by the frequent occurrence of hallucinations. Through a series of analyses, we found that (i) visual tokens provide meaningful visual information even when hallucinations occur, and (ii) the meaning of visual tokens is encoded in the text space and disambiguated under appropriate lexical constraints. Based on these observations, we propose ReVisiT, a simple, training-free decoding method that references visual tokens to guide text generation. Our approach leverages the semantic information embedded within visual tokens by projecting them onto the text token distribution. Specifically, ReVisiT dynamically selects the most relevant visual tokens at each decoding step through context-aware constraint variance minimization and uses the constraint projections to improve the output distribution, thereby better integrating visual meaning. Across five benchmarks against state-of-the-art LVLMs, ReVisiT consistently improves visual justification with minimal computational overhead, achieving results that are competitive with or superior to state-of-the-art decoding baselines while reducing computational costs by up to 2x.

Takeaways, Limitations

Takeaways:
We present a novel decoding methodology that improves the visual basis of LVLM without learning.
Demonstrating that visual tokens contain useful visual information even in hallucinatory situations.
Achieve superior performance while reducing computational costs compared to existing methods.
Limitations:
Evaluation in a limited benchmark environment.
Further research is needed to determine the generalizability of ReVisiT.
Possible dependency on specific LVLM architecture.
👍