This paper presents a novel multimodal food recognition framework that combines visual and text modalities to improve the accuracy and robustness of food recognition. The proposed approach uses a dynamic multimodal fusion strategy that adaptively integrates features from unimodal visual input and complementary text metadata. This fusion mechanism is designed to maximize the utilization of information content while mitigating the negative impact of missing or inconsistent modality data. Rigorous evaluation on the UPMC Food-101 dataset demonstrates unimodal classification accuracy of 73.60% for images and 88.84% for text. When fused across both modalities, the model achieves 97.84% accuracy, outperforming several state-of-the-art methods. Extensive experimental analysis demonstrates the robustness, adaptability, and computational efficiency of the proposed setup, highlighting its practical applicability for real-world multimodal food recognition scenarios.