This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
AHELM: A Holistic Evaluation of Audio-Language Models
Created by
Haebom
Author
Tony Lee, Haoqin Tu, Chi Heem Wong, Zijun Wang, Siwei Yang, Yifan Mai, Yuyin Zhou, Cihang Xie, Percy Liang
Outline
AHELM is a new benchmark for comprehensively evaluating audio-language models (ALMs). To address the shortcomings of existing benchmarks (lack of standardization, limitations in measurement aspects, and difficulties in comparing models), it integrates diverse datasets, including two new synthetic audio-text datasets, PARADE and CoRe-Bench. It measures ALM performance across ten critical dimensions: audio recognition, knowledge, inference, emotion detection, bias, fairness, multilingualism, robustness, toxicity, and safety. It uses standardized prompts, inference parameters, and evaluation metrics to ensure fair comparisons between models. By evaluating 14 open-weight and closed-API ALMs and three simple baseline systems, we present results showing that Gemini 2.5 Pro ranks highest across five dimensions, but exhibits group unfairness on ASR tasks. All data are publicly available at https://crfm.stanford.edu/helm/audio/v1.0.0 .