This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
This paper proposes DCScore, a novel method for measuring the diversity of synthetic datasets generated using large-scale language models (LLMs). To address the challenges of measuring diversity in existing synthetic datasets, DCScore formalizes diversity assessment as a sample classification task, leveraging inter-sample relationships. Theoretical validation demonstrates that DCScore satisfies diversity-related axioms. Experimental results on synthetic datasets demonstrate that DCScore exhibits a higher correlation with diverse pseudo-truths than existing methods, while also reducing computational costs. The code is available on GitHub.
Takeaways, Limitations
•
Takeaways:
◦
We present a novel method (DCScore) to effectively and efficiently measure the diversity of LLM-based synthetic datasets.
◦
Demonstration of improved diversity measurement performance and computational efficiency compared to existing methods.
◦
Securing the validity of DCScore based on theoretical grounds.
◦
Increased reproducibility and usability through open code.
•
Limitations:
◦
The experimental results presented may be limited to specific synthetic datasets.
◦
There is a need to consider different perspectives on defining and measuring diversity.
◦
Further evaluation of DCScore performance in real-world applications is needed.