This paper proposes MESH, a novel benchmark for systematically evaluating hallucinations in large-scale video models (LVMs). To overcome the limitations of existing benchmarks, MESH utilizes a question-answering approach to evaluate basic objects, detailed features, and subject-action pairs in a multi-layered manner. This approach mimics the human video comprehension process, aiming to more accurately identify the causes of hallucinations in LVMs. Experimental results demonstrate that while LVMs are adept at recognizing basic objects and features, their hallucination rate increases significantly in scenes containing detailed information or complex actions of multiple subjects.