Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

What Has Been Lost with Synthetic Evaluation?

Created by
  • Haebom

Author

Alexander Gill, Abhilasha Ravichander, Ana Marasovi c.

Outline

With the increasing use of large-scale language models (LLMs) for data generation, the importance of generating evaluation benchmarks has increased. This paper examines whether LLMs can meet the requirements for generating inference-based text benchmarks through two case studies. Specifically, we evaluate LLM-generated versions of two high-quality reading comprehension datasets—CondaQA, which evaluates negation inference, and DROP, which evaluates quantification inference—and compare them to crowdsourced original datasets. We find that LLMs can generate valid versions of the original datasets at low cost, following the guidelines of the original datasets, but are less challenging than human-generated benchmarks.

Takeaways, Limitations

Data generation using LLM enables the creation of cost-effective and valid benchmarks.
LLM-generated benchmarks are less difficult than human-generated benchmarks.
Benchmark creation through LLM has the limitation that it can reduce the difficulty of benchmarks compared to existing crowdsourcing methods.
Raises the need for careful consideration of benchmark generation methods using LLM.
👍