Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation

Created by
  • Haebom

Author

Enyu Zhao, Vedant Raval, Hejia Zhang, Jiageng Mao, Zeyu Shangguan, Stefanos Nikolaidis, Yue Wang, Daniel Seita

Outline

This paper proposes ManipBench, a novel benchmark for evaluating low-level reasoning in robotic manipulation. While Vision-Language Models (VLMs) are primarily used as high-level planners in robotic manipulation, research on their low-level reasoning (determining precise robot actions) has also been conducted recently. ManipBench evaluates the low-level reasoning capabilities of VLMs in robotic manipulation across various aspects, including object-to-object interaction and manipulation of deformable objects. Thirty-three representative VLMs from ten model families are extensively tested on the benchmark, analyzing model performance differences and correlations with real-world manipulation tasks. This analysis reveals a significant gap between current models and human-level understanding.

Takeaways, Limitations

Takeaways:
We present a new benchmark (ManipBench) that comprehensively evaluates the low-level robotic manipulation reasoning capabilities of VLMs.
We compare and analyze the performance of various VLMs and present correlations with actual tasks.
It clearly shows the difference between the current technological level of VLMs and human level.
Limitations:
Since ManipBench is still an early stage benchmark, more models and tasks will need to be added in the future.
Further review and improvement of the benchmark design and evaluation metrics may be required.
Further evaluation is needed for more complex robotic manipulation tasks that are beyond the scope of current benchmarks.
👍