This paper presents "Transparent Earth," a Transformer-based architecture for reconstructing subsurface properties from heterogeneous datasets with varying sparsity, resolution, and modality (e.g., stress angle, mantle temperature, and plate type). Each modality represents a different type of observation, and the model integrates modality encodings derived from descriptions of each modality via a text embedding model along with the positional encoding of the observations. This design allows the model to be extended to any number of modalities, simplifying the addition of new modalities not initially considered. Currently, eight modalities are included, including orientation, categorical classes, and continuous features such as temperature and thickness. This feature supports in-context learning, allowing predictions to be generated without input or using an arbitrary number of additional observations from a random subset of modalities. On validation data, this reduced stress angle prediction errors by more than threefold. The proposed architecture is scalable, demonstrating that performance improves with increasing parameters. These developments make Transparent Earth an early baseline model of Earth's subsurface, ultimately aiming to predict subsurface properties anywhere on Earth.