Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

MIO: A Foundation Model on Multimodal Tokens

Created by
  • Haebom

Author

Zekun Wang, King Zhu, Chunpu Xu, Wangchunshu Zhou, Jiaheng Liu, Yibo Zhang, Jiashuo Wang, Ning Shi, Siyu Li, Yizhi Li, Haoran Que, Zhaoxiang Zhang, Yuanxing Zhang, Ge Zhang, Ke Xu, Jie Fu, Wenhao Huang

Outline

MIO is a new multimodal token-based foundation model capable of understanding and generating speech, text, images, and video in an end-to-end, autoregressive manner. MIO undergoes a four-stage training process and demonstrates competitive performance on a variety of text, visual, and speech tasks, demonstrating particular excellence in video-to-text generation, visual reasoning, and instructional image editing.

Takeaways, Limitations

Takeaways:
Any input (voice, text, image, video) can be generated into any output (voice, text, image, video).
Open source implementation of features similar to GPT-4o.
Ability to generate multimodal interleaved sequences.
Excellent performance in a variety of multimodal tasks (video-to-text generation, visual thinking, instructional image editing, etc.).
Limitations:
There is no specific mention of Limitations in the paper.
👍