Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

Created by
  • Haebom

Author

Ingo Ziegler, Abdullatif K oksal, Desmond Elliott, Hinrich Sch utze

Outline

To address the challenges of building high-quality datasets for specialized tasks, this paper proposes Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method that generates synthetic datasets based on a small number of user-generated shots. CRAFT uses a large-scale public web crawl corpus and similarity-based document retrieval to find relevant documents, and leverages a directive-tuned giant language model (LLM) to augment the retrieved documents with user-defined task samples. Experiments on four diverse tasks—biology, medicine, common-sense question answering (QA), and summarization—demonstrate that CRAFT efficiently generates large, task-specific training datasets, outperforming or equaling a standard LLM on the QA task and achieving a 46-point preference improvement over models trained on existing human-curated data on the summarization task. Furthermore, it outperforms other synthetic dataset generation methods, such as Self-Instruct and Evol-Instruct, and maintains robust performance even when the quality of the initial few shots varies.

Takeaways, Limitations

Takeaways:
We present a novel method for efficiently generating large-scale, high-quality training datasets using only a small amount of data.
It shows applicability in various fields (biology, medicine, QA, summarization, etc.).
Ensures superior performance and robustness compared to existing methods.
Build datasets for specific tasks even without specialized knowledge.
Limitations:
It may depend on the performance of LLM. The performance limitations of LLM may also affect the performance of CRAFT.
Quality control of the initial shots is important, as the quality of the initial few shots can affect the outcome.
The quality and bias of web crawling data can affect results. Addressing data reliability and bias issues is necessary.
Optimization for specific tasks may be required. Generalization performance needs to be improved across a variety of tasks.
👍