Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

UFEval: Unified Fine-grained Evaluation with Task and Aspect Generalization

Created by
  • Haebom

Author

Shibo Hong, Jiahao Ying, Haiyuan Liang, Mengdi Zhang, Jun Kuang, Jiazheng Zhang, Yixin Cao

Outline

To address the challenges of open-ended output evaluation of large-scale multimodal models, this paper proposes UFEval, a fine-grained evaluator that integrates multiple tasks and aspects. UFEval is based on a hierarchical aspect taxonomy encompassing 112 fine-grained aspects across four tasks: natural language generation, image understanding, image generation, and cross-text and image generation. We trained UFEval on FRABench, a fine-grained evaluation dataset consisting of 64,000 pairwise comparison samples and 325,000 evaluation labels. Experimental results demonstrate that learning on specific aspects enables generalization to unseen aspects, and that joint learning across multiple tasks and aspects yields mutually beneficial outcomes.

Takeaways, Limitations

Takeaways:
We present an integrated and fine-grained multimodal model evaluation criterion covering a variety of tasks and modalities.
Suggesting the possibility of generalizing to unseen aspects through learning specific aspects.
Identifying the synergistic effects of collaborative learning across various tasks and aspects.
A large-scale multi-modal, aspect-level evaluation dataset provided by FRABench.
Limitations:
Further review of the reliability and bias of human and GPT-4o annotations on the FRABench dataset is needed.
There is a lack of comparative analysis of the performance of the proposed UFEval with other evaluation methodologies.
Further discussion is needed regarding the comprehensiveness and appropriateness of the 112-item classification system.
More extensive experiments and analyses are needed to determine the generalization ability of UFEval.
👍