This paper focuses on speech-to-text alignment, a crucial element in neural network-based text-to-speech (TTS) models. Autoregressive TTS models typically learn alignment online using an attention mechanism, while non-autoregressive end-to-end TTS models rely on durations extracted from external sources. In this paper, we propose a novel duration prediction framework that can provide promising phoneme-level duration distributions from given text. Experimental results demonstrate that the proposed duration model is more accurate and adaptive to conditions than existing baseline models. Specifically, it significantly improves phoneme-level alignment accuracy and makes zero-shot TTS models more robust to mismatches between prompt and input audio.