In this paper, we present a novel multi-modality multi-task network and its associated learning algorithms that can process about 12 different modality data, including images, videos, audio, text, depth, point clouds, time series, tabular, graphs, Xlines, infrared, IMU, and hyperspectral. The proposed method projects data from different modalities into a unified embedding space by leveraging modality-specific tokenizers, shared transformer architectures, and cross-attention mechanisms. We address multi-modality and multi-task scenarios by integrating modality-specific task heads for different tasks in each modality. We propose a novel pre-training strategy with iterative modality switching to initialize the network, and a learning algorithm that provides a trade-off between fully joint learning for all modalities and learning two modalities at once. We provide comprehensive evaluations on 25 datasets from 12 modalities, demonstrating state-of-the-art performance, validating the effectiveness of the proposed architecture, pre-training strategy, and adaptive multi-task learning.