This paper highlights challenges in evaluating hallucination detection methods in large-scale language models (LLMs). Existing hallucination detection methods rely on lexical redundancy-based metrics like ROUGE, which are inconsistent with human judgment and thus prone to errors. Through human studies, the researchers demonstrate that while ROUGE has high recall, it has very low precision, leading to overestimation of performance. Using human-based evaluation metrics like LLM-as-Judge, they observed that the performance of existing detection methods deteriorated by up to 45.9%. They also found that simple heuristics, such as response length, performed similarly to complex detection techniques. Therefore, they argue that a robust evaluation system that considers semantics and accurately measures the performance of hallucination detection methods is essential for ensuring the reliability of LLM output.