Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

Created by
  • Haebom

Author

Yifu Yuan, Haiqin Cui, Yaoting Huang, Yibin Chen, Fei Ni, Zibin Dong, Pengyi Li, Yan Zheng, Jianye Hao

Outline

This paper proposes "pointing" as a unified, implementation-agnostic intermediate representation to address the generalization problem of embodied AI. We define four core embodied pointing capabilities that bridge high-dimensional visual-language understanding and low-dimensional action primitives, and introduce Embodied-R1, a 3-billion-parameter visual-language model specialized for embodied reasoning and pointing. We build a large-scale dataset, Embodied-Points-200K, containing 200,000 examples from various datasets, and train Embodied-R1 using a two-stage reinforced fine-tuning (RFT) curriculum with a specialized multi-task reward scheme. Embodied-R1 achieves state-of-the-art performance on 11 embodied space and pointing benchmarks, achieving 56.2% success rates on SIMPLEREnv and 87.5% success rates on eight real-world XArm tasks without task-specific fine-tuning, demonstrating strong zero-shot generalization capability that is 62% better than a strong baseline model. It also exhibits high robustness to various visual disturbances. In conclusion, the combination of pointing-centric representations and the RFT training paradigm provides an effective and generalizable method for bridging the perception-action gap in robotics.

Takeaways, Limitations

Takeaways:
A novel approach to effectively link visual-linguistic understanding and action using 'pointing' as an intermediate expression.
Development of a powerful model (Embodied-R1) that significantly improves the zero-shot generalization ability of embodied AI.
Implementation of a model that exhibits high robustness in various environments and tasks.
Building a large-scale embodied pointing dataset (Embodied-Points-200K).
Presenting an effective model training strategy through the enhanced fine-tuning (RFT) curriculum.
Limitations:
Further validation is needed regarding the size and diversity of the Embodied-Points-200K dataset.
Further testing and validation for real-world applications are needed. Current experiments are limited to a limited number of XArm tasks.
Further research is needed on the interpretability and explainability of the model.
Further evaluation of generalizability to other types of embodied AI systems and tasks is needed.
👍