Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ArXivBench: When You Should Avoid Using ChatGPT for Academic Writing

Created by
  • Haebom

Author

Ning Li, Jingran Zhang, Justin Cui

Outline

This paper evaluates the factual accuracy of large-scale language models (LLMs), specifically their accuracy in generating links to arXiv articles. We evaluated a variety of proprietary and open-source LLMs using a novel benchmark, arXivBench, covering eight major disciplines and five subfields of computer science. The evaluation revealed that LLMs pose a significant risk to academic credibility, often generating incorrect arXiv links or referencing non-existent papers. Claude-3.5-Sonnet demonstrated relatively high accuracy, and most LLMs significantly outperformed other disciplines in the field of artificial intelligence. This study contributes to evaluating and improving the credibility of LLMs in academic use through the arXivBench benchmark. The code and dataset are publicly available.

Takeaways, Limitations

Takeaways:
It demonstrates the seriousness of the issue of factual accuracy in LLMs, especially in an academic context.
We identify field-specific variations in LLM performance and suggest future directions for LLM development and utilization.
We provide a new benchmark, arXivBench, to enable objective evaluation of the academic use of LLMs.
We emphasize the importance of research to ensure the reliability of the academic use of LLM.
Limitations:
The current benchmark is limited to arXiv papers, and does not evaluate LLM performance on other types of academic materials.
The types and versions of LLMs being assessed may be limited.
There may be limitations and room for improvement in the metrics used to evaluate the performance of LLM.
👍