Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control

Created by
  • Haebom

Author

Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, Maoqing Yao, Haoran Yang, Jiacheng Bao, Bin Zhao, Dong Wang

Outline

This paper introduces EO-Robotics, a unified embodied base model, EO-1, and a large-scale multimodal embodied inference dataset, EO-Data1.5M, containing over 1.5 million samples. EO-1 is trained using a unified architecture that seamlessly processes diverse modal inputs, including images, text, videos, and actions, and EO-Data1.5M, synergistically combining autoregressive decoding and flow-matching denoising. This enables seamless robotic action generation and multimodal embodied inference, demonstrating its effectiveness in open-world understanding and generalization across a variety of long-term, skilled manipulation tasks. This paper details the architecture of EO-1, the data organization strategy of EO-Data1.5M, and the training methodology.

Takeaways, Limitations

Takeaways:
We present an integrated embodied foundation model EO-1 that achieves superior performance in multimodal embodied inference and robot control.
Demonstrating the effectiveness of an integrated architecture that seamlessly handles diverse modal inputs.
Announcing the release of EO-Data1.5M, a large-scale multimodal embodied inference dataset containing over 1.5 million high-quality samples.
An effective training method is presented through the synergy of autoregressive decoding and flow-matching denoising.
Enhanced open-world understanding and generalization performance in long-term, skilled manipulation tasks.
Limitations:
Lack of clear comparative analysis to determine whether EO-1's performance has reached human-level flexibility.
Further analysis of the bias and generalizability of the EO-Data1.5M dataset is needed.
Further experiments are needed to evaluate the generalizability of EO-1 across a variety of robotic platforms and environments.
Lack of evaluation of energy efficiency and real-time performance.
👍