This paper addresses the issue that benchmark performance of large-scale language models (LLMs) may reflect overreliance on dataset-specific surface-level cues. To address this, we propose a meta-evaluation framework, the Chameleon Benchmark Overfitting Detector (C-BOD). C-BOD systematically distorts benchmark prompts through parameter transformations to detect overfitting in LLMs. Evaluating 26 leading LLMs on the MMLU benchmark, we find that even modest transformations incur an average performance degradation of 2.15%, with statistically significant differences in 20 of the 26 models. We find that models with higher initial accuracy exhibit greater performance degradation following transformations, and larger LLMs tend to be more sensitive to reconfiguration, suggesting that these models may be overly reliant on fixed prompt patterns. In contrast, the Llama family and existing low-accuracy models exhibit minimal performance degradation, indicating a lower reliance on surface-level cues. C-BOD's dataset- and model-independent design allows for easy integration into training pipelines, facilitating more robust language understanding.