This paper evaluates the factual accuracy of large-scale language models (LLMs), specifically their accuracy in generating links to arXiv articles. We evaluated a variety of proprietary and open-source LLMs using a novel benchmark, arXivBench, covering eight major disciplines and five subfields of computer science. The evaluation revealed that LLMs pose a significant risk to academic credibility, often generating incorrect arXiv links or referencing non-existent papers. Claude-3.5-Sonnet demonstrated relatively high accuracy, and most LLMs significantly outperformed other disciplines in the field of artificial intelligence. This study contributes to evaluating and improving the credibility of LLMs in academic use through the arXivBench benchmark. The code and dataset are publicly available.