JointDiT is a diffusion transformer that models the joint distribution of RGB images and depth information. It leverages the structural advantages of state-of-the-art diffusion transformers and excellent image prior information to generate high-quality images and geometrically plausible and accurate depth maps. Two effective techniques—adaptive scheduling weights (which vary according to the noise level of each modality) and an imbalanced timestep sampling strategy—learn the model under all noise levels. This allows it to naturally handle various combinatorial generation tasks, such as joint generation, depth estimation, and depth-conditional image generation, by controlling the timesteps of each branch. JointDiT demonstrates excellent joint generation performance and achieves similar results for depth estimation and depth-conditional image generation, suggesting that joint distribution modeling can be a viable alternative to conditional generation.