This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
Created by
Haebom
Author
Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli
Outline
This paper presents Source2Synth, a novel approach that leverages synthetic data generation to improve the performance of large-scale language models (LLMs) without expensive manual annotation. Source2Synth generates synthetic data based on real-world data sources and enhances data quality by incorporating an intermediate inference step. It improves dataset quality by removing low-quality artifacts based on answerability. We demonstrate performance gains by applying this approach to two tasks utilizing diverse data types: multi-step question answering (MHQA), which assesses complex reasoning abilities using documents, and table question answering (TQA), which assesses tool usability using tables. We achieve performance gains of 25.51% on the WikiSQL TQA task and 22.57% on the HotpotQA MHQA task, compared to existing baseline models.
Takeaways, Limitations
•
Takeaways:
◦
We present a novel method for effectively improving LLM performance by generating synthetic data based on real data sources.
◦
Improve data quality by including intermediate inference steps and removing low-quality artifacts.
◦
Demonstrates applicability to various types of data and tasks (MHQA, TQA).
◦
Significant performance improvements achieved on WikiSQL and HotpotQA.
•
Limitations:
◦
Further experimentation and analysis are needed on the scalability of Source2Synth.
◦
Identify limitations in generalization ability for various data types and tasks and improve upon them.
◦
Further research is needed to improve the objectivity and optimization of criteria for removing low-quality products.
◦
Analysis is needed to determine the impact of bias in the data sources used on the results.