Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

Created by
  • Haebom

Author

Shaina Raza, Aravind Narayanan, Vahid Reza Khazaie, Ashmal Vayani, Mukund S. Chettiar, Amandeep Singh, Mubarak Shah, Deval Pandya

Outline

Large-scale multimodal models (LMMs) have been extensively tested in tasks such as visual question answering (VQA), image caption generation, and grounding, but rigorous evaluations of their alignment with human-centered (HC) values such as fairness, ethics, and inclusivity are lacking. To address this gap, this paper presents HumaniBench , a novel benchmark consisting of 32,000 real-world image-question pairs and an evaluation tool . Labels are generated through an AI-assisted pipeline and validated by experts. HumaniBench evaluates LMMs across a variety of open and closed VQA tasks based on seven key alignment principles: fairness, ethics, empathy, inclusivity, inference, robustness, and multilingualism. These principles, grounded in AI ethics and practical requirements, provide a holistic view of social impact. Benchmarking results across various LMMs show that proprietary models generally outperform inference, fairness, and multilingualism, while open-source models outperform in robustness and grounding. Most models struggle to balance accuracy with ethical and inclusive behavior. Techniques such as thought chain prompting and test time scaling improve alignment. As the first benchmark tailored for HC alignment, HumaniBench provides a rigorous testbed to diagnose limitations and promote responsible LMM development. All data and code are publicly available for reproducibility.

Takeaways, Limitations

Takeaways:
Introducing HumaniBench, the first benchmark for rigorous evaluation of LMMs for alignment with human-centered values.
Evaluate seven key alignment principles: fairness, ethics, empathy, inclusivity, reasoning, robustness, and multilingualism through various VQA tasks.
Comparative analysis of the strengths and weaknesses of proprietary and open source models.
We show that techniques such as thought chain prompting and test time scaling contribute to improving the alignment of LMMs.
Reproducibility achieved through disclosure of all data and code.
Limitations:
Further research is needed to determine the comprehensiveness of the ethical and social considerations covered by HumaniBench.
There may be a bias towards certain models or technologies.
The scope of the benchmark is limited to the VQA task. It needs to be expanded to other multimodal tasks.
Further validation of the reliability and accuracy of the AI-assisted labeling pipeline is needed.
👍