Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Measuring Diversity in Synthetic Datasets

Created by
  • Haebom

Author

Yuchang Zhu, Huizhe Zhang, Bingzhe Wu, Jintang Li, Zibin Zheng, Peilin Zhao, Liang Chen, Yatao Bian

Outline

This paper proposes DCScore, a novel method for measuring the diversity of synthetic datasets generated using large-scale language models (LLMs). To address the challenges of measuring diversity in existing synthetic datasets, DCScore formalizes diversity assessment as a sample classification task, leveraging inter-sample relationships. Theoretical validation demonstrates that DCScore satisfies diversity-related axioms. Experimental results on synthetic datasets demonstrate that DCScore exhibits a higher correlation with diverse pseudo-truths than existing methods, while also reducing computational costs. The code is available on GitHub.

Takeaways, Limitations

Takeaways:
We present a novel method (DCScore) to effectively and efficiently measure the diversity of LLM-based synthetic datasets.
Demonstration of improved diversity measurement performance and computational efficiency compared to existing methods.
Securing the validity of DCScore based on theoretical grounds.
Increased reproducibility and usability through open code.
Limitations:
The experimental results presented may be limited to specific synthetic datasets.
There is a need to consider different perspectives on defining and measuring diversity.
Further evaluation of DCScore performance in real-world applications is needed.
👍