Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

Created by
  • Haebom

Author

Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, Zhuochen Wang, Zhaoxiang Zhang

Outline

In this paper, we propose TreeBench, a new benchmark for comprehensively evaluating the capabilities of visual evidence-based reasoning models. TreeBench is built on three principles: visual recognition of subtle objects in complex scenes, traceable evidence via bounding box evaluation, and second-order inference that tests object interactions and spatial hierarchies beyond simple object localization. It consists of 405 challenging visual question-answer pairs, which are manually annotated by experts, sampling 1,000 high-quality images from the SA-1B dataset. Even existing state-of-the-art models do not reach 60% accuracy, with OpenAI-o3 achieving 54.87% accuracy. In addition, we present TreeVGR, a training paradigm that jointly supervises localization and inference using reinforcement learning, enabling accurate localization and explainable inference paths. TreeVGR shows significant performance improvements over V*Bench, MME-RealWorld, and TreeBench.

Takeaways, Limitations

Takeaways:
We present TreeBench, a new benchmark that comprehensively evaluates the performance of visual evidence-based inference models.
Helps understand and improve the model's inference process by highlighting traceable evidence.
We present the possibility of improving the performance of visual evidence-based inference models through a new training paradigm called TreeVGR.
Provides a novel solution to complex visual question-answering problems.
Limitations:
The TreeBench dataset is relatively small (405 question-answer pairs).
High reliance on manual annotation.
TreeVGR's performance improvements may be limited to specific datasets.
There is a need to verify generalization performance for more diverse and broader visual question-answering problems.
👍