This paper highlights that despite advances in multimodal large-scale language models (LLMs), which have significantly improved the ability to analyze and understand complex data inputs across multiple modalities, long document processing remains an untapped area due to a lack of appropriate benchmarks. To address this, this paper presents Document Haystack, a comprehensive benchmark designed to evaluate the performance of Vision Language Models (VLMs) on visually complex long documents. Document Haystack features documents ranging from 5 to 200 pages and strategically inserts pure text or multimodal text-and-image "needles" at various depths within the document to challenge the retrieval capabilities of VLMs. It comprises 400 document variants and a total of 8,250 questions, supporting an objective and automated evaluation framework. This paper details the construction and characteristics of the Document Haystack dataset, presents results from key VLMs, and discusses potential research directions in this area.