This paper presents a diagnostic assessment using Construction Grammar (CxG) to address the assessment challenges posed by the use of large pre-training datasets: distinguishing between linguistic abilities that are well represented in the pre-training dataset and generalization to dynamic, real-world instances less common in the pre-training dataset. CxG provides a psycholinguistically grounded framework for testing generalization by explicitly linking syntactic forms to abstract, non-lexical meanings. We construct a novel inference assessment dataset using English phrase structures, which capitalizes on speakers' ability to abstract from common exemplars to understand and generate creative examples. This dataset addresses two central questions: whether models can "understand" the meaning of sentences that are less frequently represented in the pre-training dataset but are intuitive and easily understood by humans; and whether they can appropriately utilize structural meaning when given syntactically identical but semantically different structures. State-of-the-art models, including GPT-o1, underperform by over 40% on the second task, demonstrating a failure to generalize syntactically identical forms to distinct structural meanings, as humans do. We are making the new dataset and associated experimental data (including prompts and model responses) publicly available.