This paper proposes SpecVLM, a training-free speculative decoding (SD) framework for efficient decoding of Video Large-Scale Language Models (Vid-LLMs). While Vid-LLMs demonstrate powerful performance in video content understanding, their dense video token representations incur significant memory and computational overhead. SpecVLM minimizes information loss and improves decoding speed through stepwise video token pruning. We find that the draft model's guess is insensitive to video token pruning, maintaining accuracy while pruning up to 90% of video tokens. This process consists of two stages: the first stage selects information-rich tokens based on the attention signal of the target model, and the second stage prunes redundant tokens spatially and uniformly. Experimental results demonstrate decoding speed improvements of up to 2.68x on LLaVA-OneVision-72B and up to 2.11x on Qwen2.5-VL-32B.