Large-scale vision-language models (LVLMs) integrate visual perception and language understanding, demonstrating powerful performance across a variety of multimodal tasks. However, the contribution of visual information to the model's decoding process remains understudied, as evidenced by the frequent occurrence of hallucinations. Through a series of analyses, we found that (i) visual tokens provide meaningful visual information even when hallucinations occur, and (ii) the meaning of visual tokens is encoded in the text space and disambiguated under appropriate lexical constraints. Based on these observations, we propose ReVisiT, a simple, training-free decoding method that references visual tokens to guide text generation. Our approach leverages the semantic information embedded within visual tokens by projecting them onto the text token distribution. Specifically, ReVisiT dynamically selects the most relevant visual tokens at each decoding step through context-aware constraint variance minimization and uses the constraint projections to improve the output distribution, thereby better integrating visual meaning. Across five benchmarks against state-of-the-art LVLMs, ReVisiT consistently improves visual justification with minimal computational overhead, achieving results that are competitive with or superior to state-of-the-art decoding baselines while reducing computational costs by up to 2x.