Overcoming the data shortage problem by generating synthetic data

Synthetic data generation is a technology in which artificial intelligence creates virtual data on its own when real data is insufficient. Based on data that actually exists, AI can 'synthesize' new data and use this data to learn more.
For example, when creating a chatbot that needs to understand the legal documents of country A, if the actual legal documents of country A are not sufficient, new legal documents derived from the legal documents of existing countries can be created by generating synthetic data. . Although these documents do not actually exist, they can be used to help chatbots learn what they need to understand legal documents.
Synthetic data like this saves time and money when training AI models, and helps prepare for a variety of situations. It also gives you the flexibility to tailor your data to specific domains or languages.
Retrieval-Augmented Generation (RAG) is a model that retrieves information and generates answers based on it. Synthetic data generation for RAG can be used to build a database that RAG can search. With synthetic data, the RAG model can learn to generate answers to a variety of questions and provide more accurate and useful information about real user questions.
However, ensuring the quality of synthetic data is an important challenge. If the data does not reflect the actual situation well, the model may learn incorrect information. Therefore, when generating synthetic data, the diversity and quality of the data must be ensured and evaluated periodically to ensure that the model can provide answers appropriate to real situations. During this process, it is important to ensure that the data covers a variety of scenarios relevant to real-world tasks and to continue improving the model based on its performance.
Real use cases?
Dai et al. (2022): This study proposed a method that achieves near-state-of-the-art performance using only eight manually labeled examples and large amounts of unlabeled data (e.g., all parsed legal documents)​​ .
Use of synthetic data in machine learning models: Machine learning models trained on synthetic data can outperform models trained on real data in certain situations. This can help scientists identify situations where it may be better to use synthetic data for training, which can eliminate bias, privacy, security, and copyright issues that affect real-world datasets.
Synthetic Data Development by MOSTLY AI Company: MOSTLY AI is a leader in synthetic data generation for AI model development and software testing. This represents rapid progress in AI and synthetic data. ( Link ) This is just a claim, so it is questionable whether it actually works. It feels like promotional material, but they claim to be doing it.
In certain industries, such as finance or healthcare, there may be legal or ethical barriers to obtaining real-world data. Accordingly, this is a field where there is a strong need for generating necessary learning data based on existing consented data.
In conclusion, synthetic data generation is a huge help in quickly developing and testing AI models in data-poor situations, and is especially essential for models like RAG to generate better answers based on the information retrieved. Friends such as GPTs, Bing, Google's Bard, or Notion Q&A are representative examples. They generate better answers based on documents and files uploaded or written by users.
