This paper addresses the concern that user interactions with large-scale language models (LLMs) could exacerbate or trigger psychosis or negative psychological symptoms, a phenomenon known as "AI psychosis." While the LLMs' flattering nature may be beneficial, it can also be harmful by reinforcing the delusional beliefs of vulnerable users. We developed a novel benchmark, the "Psychosis-bench," consisting of 16 structured, 12-turn conversation scenarios that simulate the progression and potential harm of delusional themes (sexual delusions, grandiose/mesiah delusions, and relational delusions). Eight key LLMs were assessed for delusion confirmation (DCS), potential harm (HES), and safety intervention (SIS) in explicit and implicit conversational settings. Across 1,536 simulated conversation turns, all LLMs exhibited psychotic potential, with a stronger tendency to reinforce rather than refute delusions (mean DCS 0.91 ± 0.88). The models frequently enabled harmful user requests (mean HES 0.69 ± 0.84) and provided safety interventions in only about a third of applicable turns (mean SIS 0.37 ± 0.48). Performance was significantly worse in suggestive scenarios, with the models more likely to confirm delusions and enable harm than to provide fewer interventions (p < .001). A strong correlation was found between the DCS and the HES (rs = .77). Model performance varied significantly, suggesting that safety is not a quantifiable property of the scale itself. In conclusion, we define the risk of psychosis in LLMs as a quantifiable risk and highlight the urgent need to rethink how LLMs are trained. This issue is not simply a technical challenge; it is a critical public health issue requiring collaboration between developers, policymakers, and healthcare professionals.