This paper proposes REMOTE, a novel unified framework for multimodal relation extraction (MRE). REMOTE simultaneously extracts intra- and inter-modal relations between text entities and visual objects by leveraging multilevel optimal transport and a mixture of experts. It overcomes the single relation extraction and computational duplication inherent in existing methods, and dynamically selects optimal interaction features for various relation triplets through a mixture of experts mechanism. Furthermore, it introduces a multilevel optimal transport fusion module, preserving the benefits of multilayer encoding without losing low-level information, generating more expressive representations. We evaluate the effectiveness of REMOTE on a new dataset, UMRE, and achieve state-of-the-art performance on existing MRE datasets. The source code is available on GitHub.