SPIE is a novel semantic and structural post-training methodology for instruction-based image editing diffusion models. To address key challenges of alignment with user prompts and consistency with input images, we present an online reinforcement learning framework that aligns diffusion models to human preferences without the need for large datasets or extensive human annotation. It leverages visual prompts to control detailed visual editing, performing accurate and structurally consistent modifications even in complex scenes while maintaining fidelity in areas unrelated to the instruction, significantly improving alignment with the instruction and realism. Training requires only five reference images depicting specific concepts, and even after 10 training rounds, it can perform sophisticated editing in complex scenes. It also demonstrates potential applications in robotics, enhancing the visual realism of simulated environments and enhancing their utility as a proxy for real-world environments.