This paper addresses human-generated reward signals, which play a crucial role in aligning generative models with human preferences. Existing approaches that utilize LLMs as evaluators (LLM-as-a-Judge) significantly reduce the cost of manual annotation, but typically require extensive modality-specific training data and struggle to generalize well across diverse multimodality tasks. In this paper, we propose Flex-Judge, an inference-based multimodality judgment model that generalizes robustly across multiple modalities and evaluation formats using minimal text inference data. The core idea is that structured text inference explanations inherently embody generalizable decision patterns, which can effectively transfer to multimodality judgments such as images and videos. Experimental results demonstrate that Flex-Judge achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodality evaluators, despite being trained with significantly less text data. This finding has broad implications, particularly for modalities such as molecules, where comprehensive evaluation benchmarks are lacking, highlighting its practical value in resource-constrained domains. The framework presented in this paper significantly advances scalable multimodality models-as-a-judge by presenting inference-based text supervision as a powerful and cost-effective alternative to existing annotation-intensive approaches.