This paper addresses multimodality speech generation, which synthesizes high-quality speech from multiple input modalities: text, video, and reference audio. We propose a multimodality alignment diffusion transformer, AlignDiT, to address the challenges of speech intelligibility, audio-video synchronization, natural speech, and reference speaker similarity. AlignDiT builds on the context-independent learning capabilities of the DiT architecture and explores three strategies for aligning multimodal representations. Furthermore, we introduce a novel multimodality classifier-free guidance mechanism that adaptively balances information from each modality during speech synthesis.