Large-scale language models (LLMs) often generate hallucinations, which are unsubstantiated content that reduces reliability. While most existing research treats hallucination detection as a binary classification problem, real-world applications require identifying the range of hallucinations, necessitating a multi-step decision-making process. To address this issue, we evaluated pre-trained models using Chain-of-Thought (CoT) inference and confirmed that CoT inference can generate at least one correct answer over multiple samplings. Based on this, we propose RL4HS, a reinforcement learning framework that encourages inference through a range-level reward function. RL4HS is based on Group Relative Policy Optimization and introduces Class-Aware Policy Optimization to mitigate the reward imbalance problem. Experimental results on the RAGTruth benchmark (summarization, question answering, and data-to-text transformation) demonstrate that RL4HS outperforms pre-trained inference models and supervised learning-based fine-tuning, demonstrating the importance of reinforcement learning with range-level rewards in detecting the range of hallucinations.