This paper highlights the need for rigorous and comprehensive evaluation methods and benchmarks for the rapidly growing field of Vision-Language Models (VLMs). We analyze existing VLM evaluation techniques (including automated metrics, AI-based assessments, and human assessments across a variety of tasks) and introduce Robin, a novel VLM suite built by combining LLMs and VEs at various scales. Leveraging Robin, we identify limitations of existing evaluation methods at scale and propose CHIRP, a novel long-response benchmark for more robust and complete VLM evaluations, to overcome these limitations. We provide open access to Robin's training code, model suite, and CHIRP benchmarks to enhance reproducibility and advance VLM research.