EarthSynth is a diffusion-based generative foundational model proposed to address the lack of labeled data, a challenge in remote sensing image interpretation. It synthesizes diverse satellite data to generate labeled Earth observation data for downstream remote sensing image interpretation tasks. Specifically, it is the first to attempt multi-task generation in the remote sensing field, overcoming the generalization limitations of task-oriented synthesis. Trained on the EarthSynth-180K dataset, EarthSynth uses a counterfactual compositional training strategy and a 3D batch sample selection mechanism to enhance training data diversity and strengthen categorical control. Furthermore, it proposes a rule-based method called R-Filter to filter informative synthetic data. We evaluate EarthSynth on scene classification, object detection, and semantic segmentation tasks in open-world scenarios, demonstrating significant performance gains on open vocabulary understanding tasks, providing a practical solution for advancing remote sensing image interpretation.