This paper presents a framework for applying pre-trained, large-scale latent diffusion models to high-resolution synthetic aperture radar (SAR) image generation. This approach enables controlled synthesis and generation of rare or out-of-distribution scenes beyond the training set. Instead of training a small, task-specific model from scratch, we apply an open-source text-to-image-based model to the SAR modality, using semantic prior information to align prompts with SAR imaging physics (side-view geometry, oblique distance projection, and coherent speckle with heavy-tailed statistics). Using a 100,000-image SAR dataset, we compare full fine-tuning and parameter-efficient low-rank adaptation (LoRA) on a UNet diffusion backbone, a variational autoencoder (VAE), and a text encoder. The evaluation combines (i) statistical distance to the true SAR amplitude distribution, (ii) texture similarity via the gray-level co-occurrence matrix (GLCM) descriptor, and (iii) semantic alignment using the SAR-specific CLIP model. The results demonstrate that a hybrid strategy using LoRA for text encoders—full UNet tuning and learned token embeddings—best preserves SAR geometry and texture while maintaining prompt fidelity. This framework supports text-based control and multimodal conditioning (e.g., segmentation maps, TerraSAR-X, or optical guidance), opening new avenues for large-scale SAR scene data augmentation and unseen scenario simulation in Earth observation.