Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Dynaword: From One-shot to Continuously Developed Datasets

Created by
  • Haebom

Author

Kenneth Enevoldsen, Kristian N{\o}rgaard Jensen, Jan Kostkan, Balazs Szab o, Arton Kardos, Kirten Vad, Johan Heinsen, Andrea Blasi Nu nez , Gianluca Barmina, Jacob Nielsen, Rasmus Larsen, Peter Vahlstrup, Per M{\o}ldrup Dalum, Desmond Elliott, Lukas Galke, Peter Schneider-Kamp, Kristoffer Nielbo

Outline

This paper presents the Dynaword approach and the Danish Dynaword approach to address three key challenges in the development and utilization of large-scale datasets in the field of natural language processing: 1) ambiguous licensing restricts use, sharing, and derivative works; 2) static dataset distributions hinder ongoing community contributions and long-term maintenance; and 3) quality assurance processes confined to publishing teams. Dynaword is a framework for creating large-scale, open datasets that can be continuously updated through community collaboration, and Danish Dynaword is a concrete implementation that validates this approach and demonstrates its potential. Danish Dynaword contains more than four times as many tokens as existing datasets, is entirely openly licensed, and has received diverse contributions from industry and research. It also establishes a sustainable framework for ongoing community contributions and dataset evolution, including lightweight tests to ensure data format, quality, and documentation.

Takeaways, Limitations

Takeaways:
A framework for creating large-scale, open datasets that are continuously updated based on community contributions is presented.
Validating the feasibility and utility of the Dynaword approach using Danish Dynaword.
Providing an open dataset that is significantly larger (more than four times the number of tokens) than existing datasets.
Building a lightweight testing and documentation system for data quality and sustainability.
Limitations:
Further research is needed to explore the scalability of the Dynaword approach and its applicability to various languages and domains.
Further consideration is needed on effective governance and engagement mechanisms for community contributions.
It is necessary to verify whether the characteristics of Danish Dynaword can be applied to constructing datasets in other languages and domains.
👍