Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Is General-Purpose AI Reasoning Sensitive to Data-Induced Cognitive Biases? Dynamic Benchmarking on Typical Software Engineering Dilemmas

Created by
  • Haebom

Author

Francesco Sovrano, Gabriele Dominici, Rita Sevastjanova, Alessandra Stramiglio, Alberto Bacchelli

Outline

This paper presents the first dynamic benchmarking framework for assessing data-induced cognitive biases in general-purpose AI (GPAI) systems within a software engineering workflow. Starting with 16 handcrafted, realistic tasks (each featuring one of eight cognitive biases), we test whether bias-inducing linguistic cues unrelated to the task logic can lead GPAI systems to incorrect conclusions. We develop an on-demand augmentation pipeline that alters superficial details while preserving bias-inducing cues, thereby scaling the benchmark and ensuring realism. This pipeline ensures correctness, promotes diversity, and controls inference complexity by leveraging Prolog-based inference and LLM-as-a-judge verification. By evaluating leading GPAI systems, including GPT, LLaMA, and DeepSeek, we find a consistent tendency to rely on shallow linguistic heuristics rather than deep reasoning. All systems exhibited cognitive bias (ranging from 5.9% to 35% depending on type), and bias sensitivity increased rapidly with task complexity (up to 49%), highlighting a significant risk in real-world software engineering deployments.

Takeaways, Limitations

Takeaways:
We present a benchmarking framework for quantitatively measuring data-induced cognitive bias issues in GPAI systems in software engineering for the first time.
We demonstrate that leading GPAI systems exhibit cognitive biases, and that they become more susceptible to bias as task complexity increases.
We confirm the reliance of GPAI systems on shallow linguistic heuristics and emphasize the importance of deep reasoning.
Warning of the risks of cognitive bias in GPAI systems when deploying real-world software engineering.
Limitations:
Current benchmarking frameworks focus on specific types of cognitive biases and software engineering tasks, and further research is needed to generalize them to other domains or types of biases.
Accuracy assessment of on-demand augmentation pipelines relies on human evaluation, which has potential subjectivity and limitations.
The types of GPAI systems evaluated are limited, and evaluation of a wider range of systems is needed.
👍