Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation

Created by
  • Haebom

Author

Haris Riaz, Sourav Bhabesh, Vinayak Aranil, Miguel Ballesteros, Graham Horwood

Outline

In this paper, we propose MetaSynth, a synthetic data generation method for applying large-scale language models (LLMs) to specific domains. MetaSynth generates diverse synthetic data through the collaboration of multiple expert LLM agents via meta-prompting. Using 25 million tokens of MetaSynth synthetic data, we successfully apply Mistral-7B-v0.3 LLM to financial and biomedical domains, improving domain-specific performance without compromising general tasks. We verify that MetaSynth’s synthetic data approaches the diversity of the LLM pre-training corpus through seven automatic evaluation metrics. We demonstrate the superiority of MetaSynth over existing template prompting methods, and show that effective domain adaptation is possible with a small amount of diverse synthetic data.

Takeaways, Limitations

Takeaways:
We demonstrate that MetaSynth, a synthetic data generation method based on meta-prompting, is effective for domain adaptation of LLM.
We suggest that domain-specific performance of LLM can be improved even with a small amount of diverse synthetic data.
MetaSynth outperforms traditional template prompting methods.
We emphasize that the diversity of synthetic data is an important factor in improving LLM performance.
Limitations:
The effectiveness of MetaSynth is limited to a specific LLM (Mistral-7B-v0.3) and specific domains (finance, biomedicine), requiring further research on generalizability.
Additional diversity assessment methods may be needed in addition to the automated assessment metrics used.
Further analysis of the computational cost and efficiency of MetaSynth is needed.
👍