Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Created by
  • Haebom

Author

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yugang Jiang, Xipeng Qiu

Outline

AnyGPT is an any-to-any multimodal language model that utilizes discrete representations to integrate various modalities, including speech, text, images, and music. It can be trained reliably without modifying the existing large-scale language model (LLM) architecture or training method, and new modalities can be integrated into the LLM with only data-level preprocessing. We constructed a text-centric multimodal dataset for multimodality alignment pretraining and, using a generative model, synthesized the first large-scale any-to-any multimodal instruction dataset consisting of 108,000 samples that complexly interweave various modalities. Experimental results demonstrate that AnyGPT enables any-to-any multimodal conversations while achieving performance comparable to specialized models across all modalities, demonstrating that discrete representations can effectively and conveniently integrate multiple modalities within a language model. A demo can be found at https://junzhan2000.github.io/AnyGPT.github.io/ .

Takeaways, Limitations

Takeaways:
Integration of various modalities without changing the existing LLM architecture
New modalities can be added simply by preprocessing data.
Effective and convenient multi-modality integration using discrete representations
Achieves performance comparable to specialized models across all modalities
Building the first large-scale, any-to-any, multi-modality directed dataset.
Limitations:
Limitations is not explicitly mentioned in the paper. Further research is suggested to improve performance and overcome limitations.
👍