Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

EO-1: Interleaved Vision-Text-Action Pretraining for General Robot Control

Created by
  • Haebom

Author

Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, Maoqing Yao, Haoran Yang, Jiacheng Bao, Bin Zhao, Dong Wang

Outline

EO-Robotics is a research project for general-purpose embedded intelligent systems targeting human-level flexible multimodal reasoning and physical interaction. It consists of the EO-1 model and the EO-Data1.5M dataset. EO-1 is based on a unified architecture that processes multimodal inputs, including images, text, videos, and actions, and the EO-Data1.5M multimodal embedded inference dataset containing over 1.5 million samples. Trained using autoregressive decoding and flow-matching denoising, it enables robotic action generation and multimodal embedded inference. It has demonstrated open-world understanding and generalization performance through a variety of long-term, skilled manipulation tasks.

Takeaways, Limitations

Takeaways:
Handle multimodal input effectively using a unified architecture.
We improved model performance by leveraging the large-scale, high-quality dataset EO-Data1.5M.
We use a training methodology that combines autoregressive decoding and flow-matching denoising.
Demonstrated generalization performance in various robotic tasks.
Limitations:
There is no specific mention of Limitations in the paper.
👍