Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Do AI Models Perform Human-like Abstract Reasoning Across Modalities?

Created by
  • Haebom

Author

Claas Beger, Ryan Yi, Shuhao Fu, Arseny Moskvichev, Sarah W. Tsai, Sivasankaran Rajamanickam, Melanie Mitchell

Outline

While OpenAI's o3-preview inference model surpassed human accuracy on the ARC-AGI benchmark, we investigate whether state-of-the-art models recognize and infer the abstractions intended by the task creators. We examine the abstraction capabilities of models in ConceptARC. We evaluate models using settings that vary the input format (text vs. visual), whether the model can use external Python tools, and the amount of inference effort the inference model exerts. In addition to measuring output accuracy, we meticulously evaluate the natural language rules the model generates to explain its solution. This double evaluation allows us to assess whether the model solves the task using the abstractions designed by ConceptARC, rather than relying on surface-level patterns. Our results show that while some models using text-based representations match human output accuracy, the best model rules often rely on surface-level "shortcuts" and capture significantly less of the intended abstraction than humans. Therefore, evaluating accuracy alone can overestimate general abstract reasoning ability. While the output accuracy of AI models decreases sharply in visual modes, rule-level analysis suggests that the model may be underestimating this. While a significant proportion of rules still capture the intended abstraction, these rules are often not applied correctly. In short, the results show that models still lag behind humans in abstract reasoning, and using accuracy alone to assess abstract reasoning in tasks like ARC can overestimate abstract reasoning ability in textual contexts and underestimate it in visual contexts.

Takeaways, Limitations

Text-based models can achieve human-like accuracy, but they often rely on surface patterns, which can lead to overestimation of abstract reasoning abilities.
Visual models may have lower accuracy, but they can sometimes generate rules that capture the intended abstraction, which can lead to underestimation of inference ability.
Accuracy alone is not sufficient to accurately assess abstract reasoning ability.
This study presents a framework to more accurately evaluate the abstract reasoning ability of multimodal models.
In ARC-like tasks, when assessing abstract reasoning ability, rule analysis must be performed in addition to accuracy.
👍