This paper critically examines the claim that large-scale language models (LLMs), such as ChatGPT, can replace human participants in psychological research. We present conceptual arguments for the hypothesis that LLMs simulate human psychology and provide empirical evidence using several LLMs, including the CENTAUR model, which is specifically tuned to psychological responses. We demonstrate that significant differences arise between LLMs and human responses when subtle word changes lead to large semantic shifts, and that different LLMs exhibit very different responses to novel items, demonstrating the unreliability of LLMs. In conclusion, we argue that LLMs do not simulate human psychology, and that psychological researchers should consider LLMs useful but fundamentally unreliable tools, requiring validation against human responses in all new applications.