This paper explores the use of large-scale language models (LLMs) as evaluators. Existing supervised fine-tuning (SFT) approaches have limitations in tasks requiring complex inference. This paper investigates whether LLM evaluators benefit substantially from improved inference performance. Our results reveal a negative correlation between improved SFT performance and the proportion of samples with high inference demands. To overcome this limitation, we propose JudgeLRM, a novel LLM based on reinforcement learning (RL) that utilizes judger-driven rewards. JudgeLRM outperforms SFT-based models and state-of-the-art inference models, particularly in judgment tasks requiring deep inference. JudgeLRM-3B outperforms GPT-4 by 2.79% in F1 score, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79%.