Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models

Created by
  • Haebom

Author

Meidan Ding, Jipeng Zhang, Wenxuan Wang, Cheng-Yi Li, Wei-Chieh Fang, Hsin-Yu Wu, Haiqin Zhong, Wenting Chen, Linlin Shen

Outline

Med-RewardBench is the first benchmark specifically designed to evaluate reward models and evaluators for multimodal large-scale language models (MLLMs) in healthcare applications. Featuring a multimodal dataset of 1,026 expert-annotated datasets spanning 13 organ systems and 8 clinical departments, Med-RewardBench undergoes a rigorous three-step process to ensure high-quality evaluation data across six clinically important dimensions. Unlike existing benchmarks that focus on general MLLM features or evaluate models as problem solvers, Med-RewardBench considers essential evaluation dimensions such as diagnostic accuracy and clinical relevance. This study evaluates 32 state-of-the-art MLLMs, including open-source, proprietary, and healthcare-specific models, revealing significant challenges in aligning with expert judgment. Furthermore, we developed a baseline model that significantly improves performance through fine-tuning.

Takeaways, Limitations

Takeaways: Provides the first specialized benchmark for evaluating compensation models and evaluators of MLLMs in the healthcare field. Empirically demonstrates the performance of various MLLMs and Limitations. Suggests potential for performance improvement through fine-tuning. Presents evaluation criteria that consider clinical relevance and diagnostic accuracy.
Limitations: The Med-RewardBench dataset may be biased toward specific hospitals or regions. Further research is needed to determine the objectivity and generalizability of the evaluation criteria. The evaluation should be expanded to include a wider range of MLLM models. Further validation of its long-term clinical utility is needed.
👍