Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Substance over Style: Evaluating Proactive Conversational Coaching Agents

Created by
  • Haebom

Author

Vidya Srinivas, Xuhai Xu, Xin Liu, Kumar Ayush, Isaac Galatzer-Levy, Shwetak Patel, Daniel McDuff, Tim Althoff

Outline

Unlike the existing single-round response-centered NLP research, this paper focuses on coaching situations where the goal is not clear, evolves through multiple rounds of interaction, and has subjective evaluation criteria. We design and implement a multi-round coaching agent with five different conversational styles, and evaluate it by collecting first impression feedback through user studies on 155 conversations. We find significant differences between user feedback and objective evaluations by experts and language models, which provides insights into the design and evaluation of conversational coaching agents and contributes to improving human-centered NLP applications.

Takeaways, Limitations

Takeaways:
A new perspective on designing and evaluating conversational agents in multi-session coaching situations
By highlighting the gap between user feedback and expert evaluation, we highlight the importance of developing human-centered NLP systems.
Empirically demonstrating the importance of core functions and style elements of a coaching agent
Suggesting directions for system improvement through comparison of various evaluation methodologies (user feedback, expert evaluation, language model evaluation)
Limitations:
The user study size (155 conversations) may be relatively small.
Generalization may be limited as the study results are limited to a specific coaching field.
The reliability of the evaluation may be reduced due to the subjectivity of user feedback.
Additional research is needed to ensure objectivity in expert and language model evaluations.
👍