This paper presents the first comprehensive, clinically grounded, and comprehensive study of multimodal machine learning (MML), which is rapidly transforming the detection, feature analysis, and long-term monitoring of mental health disorders. In contrast to early studies that relied on discrete data streams such as speech, text, or wearable signals, recent research has focused on architectures that integrate heterogeneous modalities to capture the rich and complex features of mental disorders. This paper (i) catalogs 26 publicly available datasets that include audio, visual, physiological signals, and text modalities, and (ii) systematically compares transformer, graph, and hybrid-based fusion strategies across 28 models to highlight trends in representation learning and cross-modal alignment. Beyond summarizing current capabilities, we explore unmet challenges such as data governance and privacy, demographic and intersectional fairness, assessment explainability, and the complexity of mental health disorders in a multimodal setting. This paper aims to bridge methodological innovation and psychiatric utility, pointing the way toward a next-generation multimodal decision support system that is trustworthy to both ML researchers and mental health professionals.