Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements

Created by
  • Haebom

Author

Guangxiang Zhao, Saier Hu, Xiaoqi Jian, Jinzhu Wu, Yuhan Wu, Change Jia, Lin Sun, Xiangzheng Zhang

Outline

This paper proposes a "generalization stress test" to evaluate the generalization ability of large-scale language models (LLMs). We evaluate the generalization ability of LLMs by applying minor, controlled changes to option length, question type, and irrelevant noun substitutions. Experimental results reveal that despite high benchmark scores, LLMs exhibit significant accuracy degradation and unexpected biases (e.g., a preference for longer incorrect answers) when faced with these minor, content-preserving modifications. For example, the MMLU score of Qwen 2.5 1.5B increases from 60 to 89 when option length changes, but decreases from 89 to 36 when the question remains unchanged. Even GPT-4 experiences a 25-point accuracy loss when question type changes, with a 6-point decrease across all three modification categories. This analysis suggests that LLMs rely heavily on superficial cues rather than forming robust, abstract representations that generalize across format, lexical variation, and irrelevant content variation.

Takeaways, Limitations

Takeaways:
This shows that the high benchmark scores of LLM may not reflect actual generalization ability.
This suggests that LLMs operate on superficial clues and are based on pattern matching rather than true understanding.
We present a new methodology, the "generalization stress test," to assess the generalization ability of LLMs.
We emphasize the importance of improving generalization skills in LLM development.
Limitations:
Further research is needed to explore the generalizability and scalability of the proposed “generalized stress test.”
The type and intensity of perturbation used in testing may be limited.
Since these results are for a specific LLM and dataset, caution is needed in generalizing them to other LLMs or datasets.
👍