Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ChatBench: From Static Benchmarks to Human-AI Evaluation

Created by
  • Haebom

Author

Serina Chang, Ashton Anderson, Jake M. Hofman

Outline

This paper highlights the growing need to evaluate the joint performance of humans and LLMs, driven by the rapid proliferation of large-scale language models (LLMs). However, existing benchmarks, such as MMLU, only measure LLMs' capabilities individually. Therefore, we designed and conducted a user study that transformed MMLU questions into user-AI conversations, presenting users with questions and having them answer them through conversations with LLMs. We released ChatBench, a new dataset containing AI-only, user-only, and user-AI data for 396 questions and two LLMs, comprising 144,000 responses and 7,336 user-AI conversations. Our findings demonstrate that AI-only accuracy does not predict user-AI accuracy, and that there are significant differences across subjects such as mathematics, physics, and moral reasoning. By analyzing user-AI conversations, we provide insights into how these conversations differ from AI-only benchmarks. Finally, fine-tuning the user simulator with a subset of the ChatBench dataset improves our ability to estimate user-AI accuracy, increasing the correlation for held-out questions by more than 20%, suggesting the potential for scalable conversational evaluation.

Takeaways, Limitations

Takeaways:
It reveals the limitations of existing AI-only evaluation methods and suggests the need for a new evaluation method that considers collaboration between humans and AI.
Analyzing user-AI interactions provides new insights into evaluating AI performance.
Presenting the potential for future research advancement through the release of the ChatBench dataset.
Suggesting the possibility of improving the accuracy of user-AI interaction predictions through fine-tuning the user simulator.
Limitations:
The ChatBench dataset is limited in size to specific question types and LLM, requiring review of generalizability.
Performance improvements in user simulators may be limited to specific datasets, and generalization to a wider range of situations and user characteristics is needed.
Consideration needs to be given to the number and diversity of participants in user research.
👍