Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer

Created by
  • Haebom

Author

Haotian Ni, Yake Wei, Hang Liu, Gong Chen, Chong Peng, Hao Lin, Di Hu

Outline

This paper addresses the problem of multimodal learning, which effectively fuses information from various modalities, especially when modality quality varies across samples. Dynamic fusion strategies, such as the attention mechanism of transformers, attempt to address this problem by adaptively emphasizing modalities according to the features of the input data. However, through numerous experiments, we find that the dynamic adaptability of widely used self-attention models is reduced and they tend to favor certain modalities regardless of the data features. This bias causes a self-reinforcing loop that gradually overemphasizes the preferred modality, widening the distribution gap of attention keys across modalities and disabling the dynamic nature of the attention mechanism. In this paper, we propose Rolling Query (RollingQ), a simple yet effective method that balances attention allocation by rotating queries to break the self-reinforcing loop and alleviate the key distribution gap to restore this adaptability. Through extensive experiments on various multimodal scenarios, we verify the effectiveness of RollingQ and show that restoring cooperative dynamics is important for improving the broad functionality of widely deployed multimodal transformers. The source code can be found at https://github.com/GeWu-Lab/RollingQ_ICML2025 .

Takeaways, Limitations

Takeaways:
We reveal the limits of dynamic adaptability of widely used self-attention models.
Presents the mechanism of modality bias problem and self-reinforcing cycle.
RollingQ proposal, an effective method to solve modality bias problem.
Validating the effectiveness of RollingQ in various multi-modal scenarios.
A new direction for improving the performance of multimodal transformers.
Limitations:
Further research is needed on the generality of RollingQ.
Additional experiments on diverse multi-modal datasets and tasks are needed.
Analysis of RollingQ's computational cost and efficiency is needed.
👍