This paper explores the measurement of hallucinations in language models. Noting that current hallucination detection metrics have limitations in terms of reliability and generalizability, we conducted a large-scale evaluation of six hallucination detection metrics using four datasets, 37 different language models, and five decoding methods. The results show that existing metrics do not match human judgment, exhibit a myopic approach to the problem, and exhibit inconsistent performance improvements as model size increases. On the positive side, LLM-based evaluation, such as GPT-4, yielded the best results, and mode-exploratory decoding was shown to be effective in reducing hallucinations.