Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators

Created by
  • Haebom

Author

Jongwoo Ko, Sungnyun Kim, Sungwoo Cho, Se-Young Yun

Outline

This paper addresses human-generated reward signals, which play a crucial role in aligning generative models with human preferences. LLM-as-a-Judge approaches, which utilize LLMs as evaluators, significantly reduce the cost of manual annotation, but typically require extensive modality-specific training data and lack generalizability across diverse multimodality tasks. In this paper, we propose Flex-Judge, an inference-based multimodality judgment model that generalizes robustly across multiple modalities and evaluation formats using minimal text inference data. The core idea is that structured text inference explanations inherently embed generalizable decision patterns, enabling effective transfer to multimodality judgments such as images and videos. Experimental results demonstrate that Flex-Judge achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodality evaluators, despite being trained with significantly less text data. This finding is particularly relevant for modalities such as molecules, where comprehensive evaluation benchmarks are lacking, highlighting its practical value in resource-constrained domains. This study significantly advances scalable multimodality models-as-a-judge by presenting inference-based text supervision as a powerful and cost-effective alternative to existing annotation-intensive approaches.

Takeaways, Limitations

Takeaways:
We present a multi-modality judgment model (Flex-Judge) that generalizes across various modalities using minimal text data.
Achieves competitive performance compared to existing commercial APIs and extensively trained multi-modality evaluators.
It shows high practical utility in resource-constrained fields (e.g. molecular modality).
Demonstrating the effectiveness of inference-based text supervision and contributing to the development of scalable multi-modality models-as-a-judge.
Limitations:
Further verification of the generalization performance of the proposed model is needed.
Further research is needed to address the limitations of generalizability across different modalities and assessment formats.
Potential data bias for specific modalities.
Dependence of inference-based text data on quality and quantity.
👍