This paper addresses human-generated reward signals, which play a crucial role in aligning generative models with human preferences. LLM-as-a-Judge approaches, which utilize LLMs as evaluators, significantly reduce the cost of manual annotation, but typically require extensive modality-specific training data and lack generalizability across diverse multimodality tasks. In this paper, we propose Flex-Judge, an inference-based multimodality judgment model that generalizes robustly across multiple modalities and evaluation formats using minimal text inference data. The core idea is that structured text inference explanations inherently embed generalizable decision patterns, enabling effective transfer to multimodality judgments such as images and videos. Experimental results demonstrate that Flex-Judge achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodality evaluators, despite being trained with significantly less text data. This finding is particularly relevant for modalities such as molecules, where comprehensive evaluation benchmarks are lacking, highlighting its practical value in resource-constrained domains. This study significantly advances scalable multimodality models-as-a-judge by presenting inference-based text supervision as a powerful and cost-effective alternative to existing annotation-intensive approaches.