To address the challenges of open-ended output evaluation of large-scale multimodal models, this paper proposes UFEval, a fine-grained evaluator that integrates multiple tasks and aspects. UFEval is based on a hierarchical aspect taxonomy encompassing 112 fine-grained aspects across four tasks: natural language generation, image understanding, image generation, and cross-text and image generation. We trained UFEval on FRABench, a fine-grained evaluation dataset consisting of 64,000 pairwise comparison samples and 325,000 evaluation labels. Experimental results demonstrate that learning on specific aspects enables generalization to unseen aspects, and that joint learning across multiple tasks and aspects yields mutually beneficial outcomes.