Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MMFformer: Multimodal Fusion Transformer Network for Depression Detection

Created by
  • Haebom

Author

Md Rezwanul Haque, Md. Milon Islam, SM Taslim Uddin Raju, Hamdi Altaheri, Lobna Nassar, Fakhri Karray

Outline

This paper presents a novel multimodal network (MMFformer) for early diagnosis of depression by leveraging diverse information from social media. MMFformer utilizes a Transformer network to capture spatial features of video and a Transformer encoder to analyze the temporal dynamics of audio. It fuses features from multiple modalities using late-stage and mid-stage fusion strategies to analyze cross-correlations and extract spatiotemporal patterns related to depression. It outperforms existing state-of-the-art methods on two large-scale depression detection datasets (D-Vlog and LMVD), improving F1-score by 13.92% on the D-Vlog dataset and 7.74% on the LMVD dataset. The source code is publicly available.

Takeaways, Limitations

Takeaways:
Leveraging social media data suggests the potential to improve the accuracy of early depression diagnosis.
Demonstrating the Effectiveness of Depression Pattern Analysis through Multimodal Information Fusion
Performance improvement over existing state-of-the-art methods (D-Vlog: 13.92% F1-Score improvement, LMVD: 7.74% F1-Score improvement)
Providing research reproducibility and development potential through open source code
Limitations:
Further verification of generalization performance is needed based on the characteristics of the dataset used.
Possible lack of consideration for diverse cultural backgrounds and linguistic differences
Possible lack of in-depth discussion of privacy and ethical issues
Possibility of biasing results towards specific modalities (e.g., performance may vary depending on the quality of video or audio data)
👍