To address the high computational cost and inference latency of Vision-Language Models (VLMs), this paper proposes Fourier-VLM, a novel method for compressing visual representations in the frequency domain. Existing VLMs replace image placeholder tokens with visual features extracted from the image encoder, but the large number of visual tokens increases the context length and increases computational cost. Fourier-VLM leverages the fact that visual features are concentrated in low-frequency components and applies a low-pass filter using a two-dimensional discrete cosine transform (DCT) to compress visual representations. The DCT is efficiently computed via the fast Fourier transform (FFT), minimizing computational costs without requiring additional parameters. Experiments on various image-based benchmarks demonstrate that both the LLaVA and Qwen-VL architectures achieve competitive performance and generalization performance. Compared to LLaVA-v1.5, our proposed approach reduces inference FLOPs by up to 83.8% and improves generation speed by 31.2%.