Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

Created by
  • Haebom

Author

Yurun Chen, Xavier Hu, Yuhan Liu, Ziqi Wang, Zeyi Liao, Lin Chen, Feng Wei, Yuxi Qian, Bo Zheng, Keting Yin, Shengyu Zhang

Outline

As the autonomy and generalization capabilities of multimodal LLM-based agents advance, evaluations based on static datasets raise the issue of insufficiently assessing actual capabilities in dynamic environments and diverse tasks. To address this issue, we propose Graph2Eval. Graph2Eval is a framework that comprehensively evaluates agents' reasoning, collaboration, and interaction capabilities by automatically generating multimodal document understanding and web interaction tasks based on a knowledge graph. Using a knowledge graph constructed from external data as a workspace, it transforms semantic relationships into structured multimodal tasks through subgraph sampling, task templates, and metapaths. A multi-stage filtering pipeline based on node reachability, LLM scores, and similarity analysis ensures the quality and feasibility of the generated tasks. Graph2Eval supports end-to-end evaluation of various agent types, including single-agent, multi-agent, and web-agent, and measures their reasoning, collaboration, and interaction capabilities. We implement our framework and conduct experiments on a curated dataset called Graph2Eval-Bench containing 1,319 document understanding and web interaction scenarios to differentiate agent and model performance, reveal gaps in inference, collaboration, and web interaction in different settings, and provide a new perspective on agent evaluation.

Takeaways, Limitations

Takeaways:
A novel framework for evaluating agent capabilities in dynamic environments and diverse tasks is presented.
Automatically generate multimodal tasks using knowledge graphs.
Comprehensive assessment of reasoning, collaboration, and web interaction skills.
End-to-end evaluation support for various agent types.
Real-world experiments and performance verification using Graph2Eval-Bench.
Presenting a new perspective on agent evaluation.
Limitations:
Since it is an LLM-based task creation and evaluation, it may depend on the performance of LLM.
The complexity of building and maintaining a knowledge graph.
The complexity of the filtering pipeline to ensure the quality of the generated work.
For web interaction tasks, adaptability to changes in the web environment is required.
👍