Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset

Created by
  • Haebom

Author

Jeongkyun Park, Jung-Wook Hwang, Kwanghee Choi, Seung-Hyun Lee, Jun Hwan Ahn, Rae-Hong Park, Hyung-Min Park

Outline

The Open Large-scale Korean Audio-Visual Speech (OLKAVS) dataset boasts the largest publicly available video-audio dataset (1,150 hours of video, 1,107 Korean speakers). It was recorded in a studio environment, covering nine different viewpoints and various noise conditions. It also provides pre-trained baseline models for both video speech recognition and lip reading tasks, and includes experimental results validating the effectiveness of multimodal and multi-view learning. It is expected to overcome the limitations of existing English-centric datasets and facilitate multimodal research in diverse fields, including Korean speech recognition, speaker recognition, pronunciation level classification, and lip movement analysis.

Takeaways, Limitations

Takeaways:
Providing a large-scale Korean audio-video dataset to facilitate multimodal research in Korean.
By including various viewpoints (9 types) and noise situations, it is possible to reflect real environments and develop robust models.
Reducing research entry barriers by providing pre-trained reference models.
Suggesting research directions through verification of the effectiveness of multi-modal and multi-view learning.
Limitations:
The dataset is large in size, but lacks specific descriptions of diversity aspects (speaker characteristics, utterance content, etc.).
No clear mention of predictive model dependency during dataset construction (further research required)
👍