Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

A Little Human Data Goes A Long Way

Created by
  • Haebom

Author

Dhananjay Ashok, Jonathan May

Outline

This paper explores the use of synthetic data generation to address the cost of human annotation in natural language processing (NLP) systems. We analyze the effectiveness of gradually replacing human-generated data with synthetic data for fact verification (FV) and question answering (QA) tasks using eight diverse datasets. Our experiments reveal that replacing up to 90% of the training data with synthetic data results in minimal performance degradation, but replacing the remaining 10% results in a significant performance degradation. We demonstrate that models trained purely on synthetic data can improve performance with as few as 125 human-generated data points, while significantly larger amounts of synthetic data are required to achieve the performance gains associated with an additional 200 human-generated data points. These findings suggest that even if large-scale human annotation is not feasible, human-generating a portion of the dataset can be valuable.

Takeaways, Limitations

Takeaways:
Synthetic data demonstrates that it can be a cost-effective alternative to human annotation.
Replacing most of the training data with synthetic data may not result in significant performance degradation.
A small amount of human-annotated data can significantly improve synthetic data performance.
You can compare the cost of human annotation and synthetic data generation to determine the optimal data composition.
Limitations:
The results may be limited to specific tasks (FV, QA) and datasets.
Generalizability to other NLP tasks or datasets may be limited.
Because the quality and diversity of synthetic data significantly impact performance, further research is needed on synthetic data generation methods.
Cost comparisons are based on assumptions about specific situations, so generalizations should be made with caution.
👍