Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

MVCL-DAF++: Enhancing Multimodal Intent Recognition via Prototype-Aware Contrastive Alignment and Coarse-to-Fine Dynamic Attention Fusion

Created by
  • Haebom

Author

Haofeng Huang, Yifei Han, Long Zhang, Bin Li, Yangfan He

Outline

MVCL-DAF++ is a proposed model to address the weak semantic foundation of multimodal intent recognition (MMIR) and its low robustness under noisy or rare class conditions. It improves upon the existing MVCL-DAF by adding two major modules: first, prototype-aware contrastive alignment aligns instances to class-level prototypes to enhance semantic consistency; second, coarse-to-fine attention fusion integrates global modal summaries with token-level features to perform hierarchical cross-modal interactions. On the MIntRec and MIntRec2.0 datasets, MVCL-DAF++ achieves state-of-the-art performance, achieving +1.05% and +4.18% WF1 improvements in rare class recognition, respectively. This demonstrates the effectiveness of prototype-based learning and coarse-to-fine fusion for robust multimodal understanding. The source code is available at https://github.com/chr1s623/MVCL-DAF-PlusPlus .

Takeaways, Limitations

Takeaways:
We demonstrate that prototype-based learning and coarse-fine attention fusion are effective in improving the performance of multimodal intent recognition.
Significantly improved rare class recognition performance in particular.
A new state-of-the-art model for multimodal understanding is presented.
Reproducibility is possible through open source code.
Limitations:
Further experiments are needed to evaluate the generalization performance of the proposed model.
Lack of performance evaluation on other multi-modal datasets.
Analysis of the model's complexity and computational cost is required.
👍