In this paper, we present a speaker-conditional text-to-speech (TTS) system that supports various Indian languages and solves the problem of speech generation for unregistered speakers. Using a diffusion-based TTS architecture, a speaker encoder extracts embeddings from short reference audio samples and conditions them to a DDPM decoder for multi-speaker generation. For better prosody and naturalness, a cross-attention-based duration prediction mechanism that leverages the reference audio is used to enable more accurate and speaker-consistent timing. This improves duration modeling and overall expressiveness while generating speech that is very similar to the target speaker. In addition, to improve zero-shot generation, a classifier-free guidance is used to enable more natural speech generation for unknown speakers. In this study, language-specific speaker-conditional models are trained for several Indian languages, including Bengali, Gujarati, Hindi, Marathi, Malayalam, Punjabi, and Tamil, using the IndicSUPERB dataset.