Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation

작성자
  • Haebom

Author

Jiajun He, Naoki Sawada, Koichi Miyazaki, Tomoki Toda

Outline

This paper proposes Phoneme-Augmented Robust Contextual ASR via COntrastive Entity Disambiguation (PARCO) to address the challenges faced by automatic speech recognition (ASR) systems, which struggle with domain-specific named entities, particularly homonyms. PARCO integrates phoneme-aware encoding, contrastive entity disambiguation, entity-level supervision, and hierarchical entity filtering to improve speech discrimination, ensure complete entity detection, and reduce false positives under uncertainty. It achieves a character error rate (CER) of 4.22% on the Chinese AISHELL-1 dataset and a word error rate (WER) of 11.14% on the English DATA2 dataset under 1,000 distractors, significantly outperforming existing methods. It also demonstrates robust performance improvements on domain-specific datasets such as THCHS-30 and LibriSpeech.

Takeaways, Limitations

Takeaways:
We present a novel ASR model that effectively solves the homonym problem by utilizing phoneme-level information.
We improved the accuracy and stability of object recognition through contrastive learning and hierarchical filtering.
We have proven that our method performs better than existing methods on various datasets.
Limitations:
There is a lack of analysis of the computational complexity and resource consumption of the proposed model.
Further research is needed on generalization performance across different languages and domains.
Further evaluation of robustness in real-world environments is needed.
👍