IndexTTS2 is a proposed model to overcome the limitations of existing autoregressive-based large-scale text-to-speech (TTS) models, which offer excellent naturalness but have difficulty in duration control. It supports precise speech duration control through explicit token count specification and a free generation mode with an unspecified token count. It also allows independent control of timbre and emotion by separating emotional expression and speaker gender. It utilizes GPT latent representations to enhance the intelligibility of highly emotional speech, and a soft instruction mechanism based on Qwen3 fine-tuning enhances the convenience of emotion control. Experimental results on various datasets demonstrate that IndexTTS2 outperforms state-of-the-art zero-shot TTS models in terms of word error rate, speaker similarity, and emotional fidelity.