This paper presents LOTS (LOcalized Text and Sketch for fashion image generation), a fashion image generation method that combines sketches and text information, considering the complex creative process of fashion design. LOTS combines global descriptions with local sketch and text information to generate complete fashion images through a diffusion model-based stepwise merging strategy. Using a modular pair-centric representation, the sketch and text are encoded in a shared latent space while maintaining independent local features. Attention-based guidance integrates local and global conditions during the multi-stage denoising process of the diffusion model. We present a new fashion dataset, Sketchy, and demonstrate its superior performance over existing methods through quantitative and qualitative evaluations.