Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning

Created by
  • Haebom

Author

Abhigya Verma, Sriram Puttagunta, Seganrasan Subramanian, Sravan Ramachandran

Outline

GRAF is a structured multimodal benchmark for evaluating models on instruction-following, visual reasoning, and visual-text alignment tasks. It features programmatically generated charts and synthetically rendered tables, generated using a Python visualization library, allowing control over data semantics, structure, and clarity. Each GRAFT instance pairs a chart or table image with a systematically generated multi-step analysis question based solely on visual content. Answers are provided in a structured format, such as JSON or YAML, enabling consistent evaluation of inference and output formats. The benchmark enables comprehensive evaluation by introducing a classification of inference types, including comparison, trend identification, ranking, aggregation, proportion estimation, and anomaly detection. Reference answers adhere to rigorous factual and formal guidelines for accurate and aspect-based evaluation. GRAFT sets a new standard for evaluation in this field by providing a unified and scalable framework for fine-grained benchmarking of multimodal models on visually based structured reasoning tasks.

Takeaways, Limitations

Takeaways:
Provides a new benchmark for accurately assessing the inference ability of models for multi-step analytical questions based on visual data (charts, tables).
Control the meaning, structure, and clarity of your data using Python visualization libraries.
Consistent evaluation of inference and output formats through structured response formats (JSON, YAML).
Comprehensive evaluation possible through classification of various inference types (comparison, trend identification, etc.).
Strict reference answer guidelines enable accurate and aspect-based assessment.
Presenting a new standard for evaluating the visual reasoning ability of multimodal models.
Limitations:
Since this is a benchmark based on synthetic data, its generalization performance on real data needs to be verified.
It depends on the Python visualization library, so it is difficult to apply it to other types of visuals.
There is a need to add various types of visual reasoning tasks beyond the ones currently provided.
Depending on the size and complexity of the benchmark, evaluation requires significant computational resources.
👍