This paper addresses the potential of large-scale language models (LLMs) to improve healthcare by supporting medical research and physicians. However, their reliance on static training data poses a significant risk when medical recommendations evolve in response to new research and developments. LLMs may provide harmful advice or fail clinical reasoning tasks if they retain outdated medical knowledge. To investigate this issue, we present two novel question-answering (QA) datasets derived from systematic reviews: MedRevQA (16,501 QA pairs covering general biomedical knowledge) and MedChangeQA (a subset of 512 QA pairs where medical consensus has changed over time). Dataset evaluations on eight leading LLMs reveal a consistent reliance on outdated knowledge across all models. Furthermore, we analyze the impact of obsolete pretraining data and training strategies to explain this phenomenon and propose future directions for mitigation, laying the foundation for developing more up-to-date and reliable medical AI systems.