This paper presents a vision for multimodal, multi-task (M3T) federated-based models (FedFMs) that can provide transformative capabilities for extended reality (XR) systems. We propose a modular architecture for FedFMs that integrates the expressive power of M3T-based models with the privacy-preserving model training principles of federated learning (FL), incorporating various orchestration paradigms for model training and aggregation. We focus on coding XR challenges that impact the implementation of FedFMs along the SHIFT dimensions: sensor and modal diversity, hardware heterogeneity and system-level constraints, interaction and implemented personalization, feature/task variability, and temporal and environmental variability. We demonstrate implementation of these dimensions in emerging and anticipated XR system applications and propose evaluation metrics, dataset requirements, and design tradeoffs necessary for the development of resource-aware FedFMs. We aim to provide a technical and conceptual foundation for context-aware privacy-preserving intelligence in next-generation XR systems.