Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models

Created by
  • Haebom

Author

Huanyu Wang, Jushi Kai, Haoli Bai, Lu Hou, Bo Jiang, Ziwei He, Zhouhan Lin

Outline

To address the high computational cost and inference latency of Vision-Language Models (VLMs), this paper proposes Fourier-VLM, a novel method for compressing visual representations in the frequency domain. Existing VLMs replace image placeholder tokens with visual features extracted from the image encoder, but the large number of visual tokens increases the context length and increases computational cost. Fourier-VLM leverages the fact that visual features are concentrated in low-frequency components and applies a low-pass filter using a two-dimensional discrete cosine transform (DCT) to compress visual representations. The DCT is efficiently computed via the fast Fourier transform (FFT), minimizing computational costs without requiring additional parameters. Experiments on various image-based benchmarks demonstrate that both the LLaVA and Qwen-VL architectures achieve competitive performance and generalization performance. Compared to LLaVA-v1.5, our proposed approach reduces inference FLOPs by up to 83.8% and improves generation speed by 31.2%.

Takeaways, Limitations

Takeaways:
We demonstrate that frequency-domain compression can effectively reduce the computational cost and inference latency of VLMs.
Achieve efficient performance improvements without additional parameters.
It shows excellent generalization performance on various architectures such as LLaVA and Qwen-VL.
Significantly improves the efficiency and practicality of VLMs for practical applications.
Limitations:
The performance improvements of the proposed method may be biased towards specific datasets or architectures. More extensive experiments are needed to verify generalization performance.
We assume energy is concentrated in low-frequency components, but further research is needed to determine whether this assumption can always be applied to all image data.
Due to limitations of DCT-based compression, there is a possibility of information loss in high-frequency components. Further research may be needed to minimize the resulting performance degradation.
👍