This paper presents the first dynamic benchmarking framework for assessing data-induced cognitive biases in general-purpose AI (GPAI) systems within a software engineering workflow. Starting with 16 handcrafted, realistic tasks (each featuring one of eight cognitive biases), we test whether bias-inducing linguistic cues unrelated to the task logic can lead GPAI systems to incorrect conclusions. We develop an on-demand augmentation pipeline that alters superficial details while preserving bias-inducing cues, thereby scaling the benchmark and ensuring realism. This pipeline ensures correctness, promotes diversity, and controls inference complexity by leveraging Prolog-based inference and LLM-as-a-judge verification. By evaluating leading GPAI systems, including GPT, LLaMA, and DeepSeek, we find a consistent tendency to rely on shallow linguistic heuristics rather than deep reasoning. All systems exhibited cognitive bias (ranging from 5.9% to 35% depending on type), and bias sensitivity increased rapidly with task complexity (up to 49%), highlighting a significant risk in real-world software engineering deployments.