This paper systematically evaluates five methods for improving prompt robustness in large-scale language models (LLMs), addressing their vulnerability to subtle, nonsense prompt changes (e.g., punctuation, formatting). We benchmarked eight models from the Llama, Qwen, and Gemma families and 52 tasks from the Natural Instructions dataset. We evaluated the robustness of these methods in both fine-tuning and in-context learning paradigms and tested their generalization performance against various types of distributional variations. Furthermore, we extended our analysis to GPT-4.1 and DeepSeek V3 to evaluate the robustness of state-of-the-art models to format changes. This provides actionable insights into the relative effectiveness of these robustness methods, enabling practitioners to make informed decisions when aiming for reliable and stable LLM performance in real-world applications.