This paper proposes a "generalization stress test" to evaluate the generalization ability of large-scale language models (LLMs). We evaluate the generalization ability of LLMs by applying minor, controlled changes to option length, question type, and irrelevant noun substitutions. Experimental results reveal that despite high benchmark scores, LLMs exhibit significant accuracy degradation and unexpected biases (e.g., a preference for longer incorrect answers) when faced with these minor, content-preserving modifications. For example, the MMLU score of Qwen 2.5 1.5B increases from 60 to 89 when option length changes, but decreases from 89 to 36 when the question remains unchanged. Even GPT-4 experiences a 25-point accuracy loss when question type changes, with a 6-point decrease across all three modification categories. This analysis suggests that LLMs rely heavily on superficial cues rather than forming robust, abstract representations that generalize across format, lexical variation, and irrelevant content variation.