This paper highlights the growing need to evaluate the joint performance of humans and LLMs, driven by the rapid proliferation of large-scale language models (LLMs). However, existing benchmarks, such as MMLU, only measure LLMs' capabilities individually. Therefore, we designed and conducted a user study that transformed MMLU questions into user-AI conversations, presenting users with questions and having them answer them through conversations with LLMs. We released ChatBench, a new dataset containing AI-only, user-only, and user-AI data for 396 questions and two LLMs, comprising 144,000 responses and 7,336 user-AI conversations. Our findings demonstrate that AI-only accuracy does not predict user-AI accuracy, and that there are significant differences across subjects such as mathematics, physics, and moral reasoning. By analyzing user-AI conversations, we provide insights into how these conversations differ from AI-only benchmarks. Finally, fine-tuning the user simulator with a subset of the ChatBench dataset improves our ability to estimate user-AI accuracy, increasing the correlation for held-out questions by more than 20%, suggesting the potential for scalable conversational evaluation.