Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Created by
  • Haebom

Author

Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu,

Outline

This paper highlights the limitations of existing spatial reasoning, which fails to consider object orientation, a crucial factor in 6-DOF micromanipulation. Existing pose representation methods rely on predefined frames or templates, limiting generalization and semantic foundations. To address this, we propose the concept of "semantic orientation," which defines object orientation using natural language without a reference frame (e.g., the "plug-in" orientation of a USB, the "handle" orientation of a cup). We build a large-scale semantically oriented 3D object dataset, OrienText300K, and develop a general model, PointSO, for zero-shot semantic orientation prediction. We present the SoFar framework, which integrates semantic orientation into a VLM agent to enable 6-DOF spatial reasoning and generate robot motions. Experimental results demonstrate the effectiveness and generalization of SoFar, achieving a zero-shot success rate of 48.7% on Open6DOR and 74.9% on SIMPLER-Env.

Takeaways, Limitations

Takeaways:
Improving the accuracy of six-degree-of-freedom micromanipulation using semantic orientation representation using natural language without a reference frame.
Providing OrienText300K, a large-scale semantic orientation annotation dataset.
Development of a zero-shot semantic orientation prediction model, PointSO, and a 6-DOF spatial inference framework, SoFar.
Achieving high zero-shot success rates on Open6DOR and SIMPLER-Env.
Limitations:
Further validation of the versatility and diversity of the OrienText300K dataset is needed.
The possibility that the performance of the PointSO model may be biased towards certain types of objects or orientations.
Further research is needed on the application and stability of the SoFar framework to real-world robotic systems.
There is a need to evaluate generalization performance for complex objects or multi-object interactions.
👍