In this paper, we propose ThinkPRM, which implements a step-by-step reward model by exploiting the generation of the thought process (CoT) of a language model to improve the data efficiency of step-by-step verifiers (PRMs). ThinkPRM is trained with significantly fewer step-by-step labels (1%) than existing discriminative PRMs, and outperforms existing methods on several benchmarks including ProcessBench, MATH-500, and AIME '24 by leveraging the inference ability of the long CoT model. In particular, it also outperforms existing PRMs in out-of-domain evaluations on some subsets of GPQA-Diamond and LiveCodeBench. It also scales validation computation more efficiently than LLM-as-a-Judge under the same token budget. In conclusion, ThinkPRM demonstrates the value of a generative and long CoT PRM that can scale test-time computation while learning with minimal supervision.