Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

PyVision: Agentic Vision with Dynamic Tooling

Created by
  • Haebom

Author

Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, Chen Wei

Outline

This paper presents PyVision, an interactive, multi-turn framework to overcome the limitations of large-scale language models (LLMs) in visual reasoning. PyVision enables flexible and interpretable problem solving by allowing LLMs to autonomously generate, execute, and refine Python-based tools tailored to a given task. We develop a taxonomy of tools generated by PyVision and analyze their use across various benchmarks. Experimental results demonstrate that PyVision achieves consistent performance gains, including a 7.8% improvement in V* performance on GPT-4.1 and a 31.1% improvement in VLMsAreBlind-mini performance on Claude-4.0-Sonnet. This suggests that dynamic tool utilization enables models to go beyond simply using tools to invent them, leading to more autonomous visual reasoning.

Takeaways, Limitations

Takeaways:
Performance Improvements in LLM-Based Visual Reasoning: Visual reasoning performance of GPT-4.1 and Claude-4.0-Sonnet models improved with PyVision.
Presenting the possibility of dynamic tool creation and utilization: The LLM presents a new paradigm for creating and utilizing tools as needed.
Flexible and interpretable problem solving: PyVision enables more flexible and interpretable visual reasoning.
Potential for development into a self-directed visual reasoning system: The LLM demonstrates the potential for evolving beyond simply using tools to become a more self-directed system that generates and utilizes tools to solve problems.
Limitations:
Further research is needed on PyVision's generalization performance and applicability to various visual inference problems.
Scalability limitations due to dependency on Python-based tools.
The safety and reliability of the generated tool need to be verified.
These are evaluation results for specific models (GPT-4.1, Claude-4.0-Sonnet), and further research is needed to determine generalizability to other models.
👍