Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Beyond Memorization: Assessing Semantic Generalization in Large Language Models Using Phrasal Constructions

Created by
  • Haebom

Author

Wesley Scivetti, Melissa Torgbi, Austin Blodgett, Mollie Shichman, Taylor Hudson, Claire Bonial, Harish Tayyar Madabushi

Outline

This paper presents a diagnostic assessment using Construction Grammar (CxG) to address the assessment challenges posed by the use of large pre-training datasets: distinguishing between linguistic abilities that are well represented in the pre-training dataset and generalization to dynamic, real-world instances less common in the pre-training dataset. CxG provides a psycholinguistically grounded framework for testing generalization by explicitly linking syntactic forms to abstract, non-lexical meanings. We construct a novel inference assessment dataset using English phrase structures, which capitalizes on speakers' ability to abstract from common exemplars to understand and generate creative examples. This dataset addresses two central questions: whether models can "understand" the meaning of sentences that are less frequently represented in the pre-training dataset but are intuitive and easily understood by humans; and whether they can appropriately utilize structural meaning when given syntactically identical but semantically different structures. State-of-the-art models, including GPT-o1, underperform by over 40% on the second task, demonstrating a failure to generalize syntactically identical forms to distinct structural meanings, as humans do. We are making the new dataset and associated experimental data (including prompts and model responses) publicly available.

Takeaways, Limitations

Takeaways:
Provides a deeper understanding of the generalization ability of large-scale language models (LLMs).
We present a new evaluation framework utilizing Construction Grammar (CxG).
We are making publicly available a new dataset that clearly demonstrates the limitations of LLM.
It contributes to analyzing the impact of bias in pre-training data on the performance of LLM.
Limitations:
The evaluation dataset focuses solely on English phrase structures, which may limit its generalizability to other languages or structures.
Due to its strong reliance on the CxG framework, its interpretation may differ from other theoretical perspectives.
The models evaluated may be limited. Further experiments with a wider range of models are needed.
👍