In this paper, we present a federated learning (FL)-based continuous instruction fine-tuning (FCIT) benchmark to address the challenges of collecting massive data and computational costs required for instruction fine-tuning of large-scale multimodal models (LMMs). Unlike existing FL methods that assume a fixed number of tasks, FCIT models real-world situations where clients continuously acquire new knowledge and struggle to maintain existing tasks. To this end, we construct a benchmark that includes two realistic scenarios, four settings, and 12 instruction fine-tuning datasets, and propose a method to address various data heterogeneities and forgetting issues through dynamic knowledge construction and subspace selective activation. Experimental results show that the proposed method significantly improves model performance. The code and datasets are publicly available.