In this paper, we propose FedMM-X (Federated Multi-Modal Explainable Intelligence), a novel framework for integrating multimodal data, including vision, language, and speech, for AI systems operating in real-world environments. FedMM-X integrates federated learning and explainable multimodal inference to ensure trustworthy intelligence in distributed and dynamic environments. It leverages cross-modal consistency checks, client-level interpretability mechanisms, and dynamic trust calibration to address the challenges of data heterogeneity, modal imbalance, and mis-distributional generalization. Through rigorous evaluations on federated multimodal benchmarks that include vision-language tasks, we demonstrate that it improves both accuracy and interpretability while reducing vulnerability to adversarial and spurious correlations. We also present a novel trust score aggregation method to quantify global model trust under dynamic client participation. These results pave the way for the development of robust, interpretable, and socially responsible AI systems in real-world environments.