Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ProactiveEval: A Unified Evaluation Framework for Proactive Dialogue Agents

Created by
  • Haebom

Author

Tianjian Liu, Fanqi Wan, Jiajian Guo, Xiaojun Quan

Outline

This paper proposes ProactiveEval, a unified framework for evaluating the lexical conversational ability of large-scale language models (LLMs). To address the limitations of previous studies, which have focused on specific domains or task-oriented scenarios and thus limited comprehensive exploration of the models' lexical conversational ability, we decompose lexical conversation into two aspects: goal planning and conversation guidance. We establish evaluation metrics across multiple domains. Furthermore, we design this framework to automatically generate diverse and challenging evaluation data. We develop 328 evaluation environments across six different domains and experiment with 22 LLMs, demonstrating that DeepSeek-R1 and Claude-3.7-Sonnet perform well on the goal planning and conversation guidance tasks, respectively. Finally, we investigate the impact of reasoning ability on lexical behavior and discuss implications for future model development.

Takeaways, Limitations

Takeaways:
Presenting an integrated and systematic framework (ProactiveEval) for assessing pre-test communication skills in LLM.
Extensive experiments on various domains and LLMs have revealed models (DeepSeek-R1, Claude-3.7-Sonnet) that demonstrate excellent performance.
Clarifying the relationship between reasoning ability and pre-existing conversational ability and suggesting future model development directions.
Limitations:
Further research is needed to determine the generalizability of the ProactiveEval framework.
Further review of the diversity and balance of assessment data is needed.
Caution is needed in interpreting results that are biased towards specific domains.
Further discussion is needed on the definition and measurement of pre-existing conversational skills.
👍