This paper explores the potential and limitations of automating literature reviews using large-scale language models (LLMs). While LLMs have the potential to automate the literature review process, including document collection, organization, and summarization, their effectiveness in automating comprehensive and reliable literature reviews remains unclear. This study presents a framework for automatically evaluating the performance of LLMs in three core tasks: generating references, summarizing literature, and writing literature reviews. We assess the hallucination rate of generated references and introduce a multidimensional evaluation metric that measures the semantic coverage and factual consistency of the summaries and writing compared to human-generated ones. Experimental results show that even state-of-the-art models, despite recent advances, generate hallucinatory references. Furthermore, we demonstrate that the performance of different models in literature review writing varies across disciplines.