This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Created by
Haebom
Author
Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yugang Jiang, Xipeng Qiu
Outline
AnyGPT is an any-to-any multimodal language model that utilizes discrete representations to integrate various modalities, including speech, text, images, and music. It can be trained reliably without modifying the existing large-scale language model (LLM) architecture or training method, and new modalities can be integrated into the LLM with only data-level preprocessing. We constructed a text-centric multimodal dataset for multimodality alignment pretraining and, using a generative model, synthesized the first large-scale any-to-any multimodal instruction dataset consisting of 108,000 samples that complexly interweave various modalities. Experimental results demonstrate that AnyGPT enables any-to-any multimodal conversations while achieving performance comparable to specialized models across all modalities, demonstrating that discrete representations can effectively and conveniently integrate multiple modalities within a language model. A demo can be found at https://junzhan2000.github.io/AnyGPT.github.io/ .