Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CHIRP: A Fine-Grained Benchmark for Open-Ended Response Evaluation in Vision-Language Models

Created by
  • Haebom

Author

Alexis Roger, Prateek Humane, Daniel Z. Kaplan, Kshitij Gupta, Qi Sun, George Adamopoulos, Jonathan Siu Chi Lim, Quentin Anthony, Edwin Fennell, Irina Rish

Outline

This paper highlights the need for rigorous and comprehensive evaluation methods and benchmarks for the rapidly growing field of Vision-Language Models (VLMs). We analyze existing VLM evaluation techniques (including automated metrics, AI-based assessments, and human assessments across a variety of tasks) and introduce Robin, a novel VLM suite built by combining LLMs and VEs at various scales. Leveraging Robin, we identify limitations of existing evaluation methods at scale and propose CHIRP, a novel long-response benchmark for more robust and complete VLM evaluations, to overcome these limitations. We provide open access to Robin's training code, model suite, and CHIRP benchmarks to enhance reproducibility and advance VLM research.

Takeaways, Limitations

Takeaways:
Contribute to the advancement of VLM research by analyzing the Limitations of existing VLM evaluation methods by scale and proposing a new benchmark, CHIRP, to overcome them.
We offer Robin, a new VLM suite that combines LLMs and VEs of various scales, to increase the reproducibility of VLM studies.
Contribute to the VLM research community through the CHIRP benchmark and the release of the Robin model and code.
Limitations:
Further review of the scale and diversity of the CHIRP benchmark may be necessary.
Further analysis may be needed to determine how well the Robin model performs compared to other VLMs.
Detailed descriptions and reliability analyses of human assessments may be lacking.
👍