This paper systematically investigates the performance, scalability, and limitations of large-scale reasoning models (LRMs). Unlike previous studies that mainly focus on the accuracy of the final answer, this study analyzes not only the final answer but also the internal reasoning process by using a controllable puzzle environment where the complexity can be precisely adjusted. The experimental results show that LRMs completely collapse in accuracy beyond a certain complexity, and the reasoning effort increases to a certain extent as the problem complexity increases, but then decreases despite the remaining token budget, showing a paradoxical scalability limit. In addition, we present three performance areas according to low, medium, and high complexity tasks by comparing them with standard LLMs, and reveal the limitations of LRMs in clear computation and consistent reasoning. Through the analysis of the reasoning process, we study the solution search pattern and the computational behavior of the model, raising questions about the strengths and limitations of LRMs and their reasoning ability.