This paper discusses interactive text-to-image retrieval (I-TIR), which enables cutting-edge services in areas such as e-commerce and education. Existing I-TIR methods rely on fine-tuned multimodal large-scale language models (MLLMs), which have the disadvantages of high training and update costs and poor generalization performance. In particular, fine-tuning narrows the pre-training distribution of MLLMs and deteriorates generalization performance, and I-TIR frequently encounters queries and images that are not well represented in the training dataset because it increases the diversity and complexity of queries. To address these issues, this paper proposes a Diffusion Augmented Retrieval (DAR) framework that utilizes a diffusion model (DM) for text-to-image mapping while maintaining robust performance without MLLM fine-tuning. DAR generates multiple intermediate representations through LLM-based dialogue augmentation and DM to richly describe users’ information needs and identify semantically and visually relevant images more accurately.