Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

Created by
  • Haebom

Author

Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu, Hiromi Wakaki, Yuki Mitsufuji

Outline

This paper addresses the performance enhancement of large-scale music language models (LLMs) for music understanding tasks that analyze and interpret diverse musical elements. While previous research has primarily focused on integrating music and text inputs, the potential of integrating additional modalities, such as video, image, and text-based music features, remains unexplored. To address this, we propose DeepResonance, a multimodal music understanding LLM fine-tuned through multi-way instruction tuning using music, text, image, and video data aligned in various ways. DeepResonance integrates visual and text-based music feature content using three four-way training and evaluation datasets: Music4way-MI2T, Music4way-MV2T, and Music4way-Any2T. Furthermore, we introduce multi-sampled ImageBind embeddings and a pre-LLM fusion transformer to enhance modality fusion for multi-modal instruction tuning. Experimental results demonstrate that DeepResonance achieves state-of-the-art performance on six music understanding tasks, highlighting the benefits of auxiliary modalities and the structural superiority of DeepResonance. In this paper, we open-source the code, model, and dataset we developed (github.com/sony/DeepResonance).

Takeaways, Limitations

Takeaways:
Integrating multimodal information (music, text, images, and video) to improve music understanding LLM performance
An effective modality fusion strategy utilizing multi-modality instruction coordination and pre-LLM fusion transformers is presented.
New multimodal music datasets, including Music4way-MI2T, Music4way-MV2T, and Music4way-Any2T, are released.
Achieving state-of-the-art performance on six music understanding tasks
Increase research reproducibility and development potential through open-sourcing of code, models, and datasets.
Limitations:
Further review of the size and diversity of the presented dataset is needed.
Need to evaluate generalization performance across various music genres and styles
The need for stronger comparative analysis with other multimodal LLMs
Research is needed on the interpretability and explainability of the model.
👍