This paper critically examines the claim that large-scale language models (LLMs), such as ChatGPT, can replace human participants in psychological research. We present a conceptual argument for the hypothesis that LLMs simulate human psychology and empirically support this hypothesis by demonstrating discrepancies between LLMs and human responses based on semantic changes. Specifically, we demonstrate that several LLMs, including the CENTAUR model fine-tuned for psychological responses, respond differently to novel items, highlighting the unreliability of LLMs. Therefore, we conclude that while LLMs are useful tools, they should be treated as fundamentally unreliable tools that must be validated against human responses in any new application.