Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

AHELM: A Holistic Evaluation of Audio-Language Models

Created by
  • Haebom

Author

Tony Lee, Haoqin Tu, Chi Heem Wong, Zijun Wang, Siwei Yang, Yifan Mai, Yuyin Zhou, Cihang Xie, Percy Liang

Outline

AHELM is a new benchmark for comprehensively evaluating audio-language models (ALMs). To address the shortcomings of existing benchmarks (lack of standardization, limitations in measurement aspects, and difficulties in comparing models), it integrates diverse datasets, including two new synthetic audio-text datasets, PARADE and CoRe-Bench. It measures ALM performance across ten critical dimensions: audio recognition, knowledge, inference, emotion detection, bias, fairness, multilingualism, robustness, toxicity, and safety. It uses standardized prompts, inference parameters, and evaluation metrics to ensure fair comparisons between models. By evaluating 14 open-weight and closed-API ALMs and three simple baseline systems, we present results showing that Gemini 2.5 Pro ranks highest across five dimensions, but exhibits group unfairness on ASR tasks. All data are publicly available at https://crfm.stanford.edu/helm/audio/v1.0.0 .

Takeaways, Limitations

Takeaways:
We present AHELM, a standardized benchmark for ALM evaluation, to enable fair comparison between models.
Measure the overall performance of ALM by comprehensively evaluating various aspects (audio recognition, inference, bias, safety, etc.).
Suggesting ALM development directions through performance comparison between existing models and reference systems.
We plan to continuously update AHELM to add new datasets and models.
Limitations:
The number of models currently included in the benchmark may be limited.
Additional validation is needed on the scale and generalization performance of new datasets (PARADE, CoRe-Bench).
Further analysis is needed to interpret the evaluation results in specific aspects.
👍