This paper addresses the problem of hallucinations in medically relevant large-scale language models (LLMs) responding to patient questions. Unlike previous studies that focused on assessing LLMs' medical knowledge through standardized medical exam questions, this study analyzes hallucinations in LLMs' responses to medical questions from real patients. To achieve this, we present MedHalu, a new benchmark comprised of various medical topics and hallucination responses generated by LLMs, and we annotate hallucination types and text segments in detail. Furthermore, we propose MedHaluDetect, a comprehensive framework for evaluating LLMs' hallucination detection capabilities, and study the vulnerability of three groups of individuals to medical hallucinations: medical professionals, LLMs, and laypeople. Our results show that LLMs perform significantly worse than medical professionals and, in some cases, laypeople in detecting hallucinations. We propose an expert-involved approach that integrates expert inferences into LLM inputs, thereby improving the hallucination detection performance of LLMs (e.g., a 6.3% improvement in the macro-F1 score for GPT-4).