In this paper, we propose a training-free spatiotemporal token merging (STTM) method to address the computational complexity problem of video large-scale language models (LLMs). By exploiting the local spatial and temporal redundancy of video data, which has been overlooked in previous studies, we perform multi-resolution spatial token transformation using a quadtree structure and directed pairwise merging in the temporal dimension. This decomposed merging method outperforms existing token reduction methods on six video question-answering benchmarks, achieving a twofold speedup with only 0.5% accuracy reduction when the number of tokens is reduced by 50%, and a threefold speedup with only 2% accuracy reduction when the number of tokens is reduced by 30%. Furthermore, STTM allows reuse of the KV cache across questions for the same video, regardless of the query.