Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators

Created by
  • Haebom

Author

Jongwoo Ko, Sungnyun Kim, Sungwoo Cho, Se-Young Yun

Outline

This paper addresses human-generated reward signals, which play a crucial role in aligning generative models with human preferences. Existing approaches that utilize LLMs as evaluators (LLM-as-a-Judge) significantly reduce the cost of manual annotation, but typically require extensive modality-specific training data and struggle to generalize well across diverse multimodality tasks. In this paper, we propose Flex-Judge, an inference-based multimodality judgment model that generalizes robustly across multiple modalities and evaluation formats using minimal text inference data. The core idea is that structured text inference explanations inherently embody generalizable decision patterns, which can effectively transfer to multimodality judgments such as images and videos. Experimental results demonstrate that Flex-Judge achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodality evaluators, despite being trained with significantly less text data. This finding has broad implications, particularly for modalities such as molecules, where comprehensive evaluation benchmarks are lacking, highlighting its practical value in resource-constrained domains. The framework presented in this paper significantly advances scalable multimodality models-as-a-judge by presenting inference-based text supervision as a powerful and cost-effective alternative to existing annotation-intensive approaches.

Takeaways, Limitations

Takeaways:
We present a multi-modality evaluation model that generalizes well across various modalities using minimal text data.
Provides a more efficient and cost-effective multi-modality model evaluation method than conventional annotation-intensive approaches.
It has been shown that it can be effectively utilized even in resource-poor fields (e.g. molecular modality).
Demonstrating the utility of inference-based text supervision.
Limitations:
The proposed model's performance may be biased toward specific datasets or tasks (specific Limitations is not explicitly mentioned in the paper).
Further research may be needed to determine the transparency and interpretability of the reasoning process (a detailed description of the reasoning process is lacking).
👍