Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Created by
  • Haebom

Author

Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli

Outline

This paper presents Source2Synth, a novel approach that leverages synthetic data generation to improve the performance of large-scale language models (LLMs) without expensive manual annotation. Source2Synth generates synthetic data based on real-world data sources and enhances data quality by incorporating an intermediate inference step. It improves dataset quality by removing low-quality artifacts based on answerability. We demonstrate performance gains by applying this approach to two tasks utilizing diverse data types: multi-step question answering (MHQA), which assesses complex reasoning abilities using documents, and table question answering (TQA), which assesses tool usability using tables. We achieve performance gains of 25.51% on the WikiSQL TQA task and 22.57% on the HotpotQA MHQA task, compared to existing baseline models.

Takeaways, Limitations

Takeaways:
We present a novel method for effectively improving LLM performance by generating synthetic data based on real data sources.
Improve data quality by including intermediate inference steps and removing low-quality artifacts.
Demonstrates applicability to various types of data and tasks (MHQA, TQA).
Significant performance improvements achieved on WikiSQL and HotpotQA.
Limitations:
Further experimentation and analysis are needed on the scalability of Source2Synth.
Identify limitations in generalization ability for various data types and tasks and improve upon them.
Further research is needed to improve the objectivity and optimization of criteria for removing low-quality products.
Analysis is needed to determine the impact of bias in the data sources used on the results.
👍