To address the challenges of long-term image understanding using a Vision Large-Scale Language Model (VLLM), this paper presents a Balanced-VLLM (B-VLLM) framework that utilizes a text-conditional adaptive frame selection module, a temporal frame token merging technique, a spatial token sampling module, and a merging strategy. To address the issues that existing VLLMs suffer from, such as loss of temporal or spatial information due to image downsampling or a reduction in the number of visual tokens in each frame, we propose a method that effectively utilizes task-relevant spatiotemporal cues while limiting the number of visual tokens within the VLLM's context window length. Experimental results demonstrate that B-VLLM demonstrates superior performance on various image understanding benchmarks.