This paper addresses the performance enhancement of large-scale music language models (LLMs) for music understanding tasks that analyze and interpret diverse musical elements. While previous research has primarily focused on integrating music and text inputs, the potential of integrating additional modalities, such as video, image, and text-based music features, remains unexplored. To address this, we propose DeepResonance, a multimodal music understanding LLM fine-tuned through multi-way instruction tuning using music, text, image, and video data aligned in various ways. DeepResonance integrates visual and text-based music feature content using three four-way training and evaluation datasets: Music4way-MI2T, Music4way-MV2T, and Music4way-Any2T. Furthermore, we introduce multi-sampled ImageBind embeddings and a pre-LLM fusion transformer to enhance modality fusion for multi-modal instruction tuning. Experimental results demonstrate that DeepResonance achieves state-of-the-art performance on six music understanding tasks, highlighting the benefits of auxiliary modalities and the structural superiority of DeepResonance. In this paper, we open-source the code, model, and dataset we developed (github.com/sony/DeepResonance).