Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

Created by
  • Haebom

Author

Mengqi Wang, Zhan Liu, Zengrui Jin, Guangzhi Sun, Chao Zhang, Philip C. Woodland

Outline

This study presents an empirical study of the application of the Diffusion-Based Large Language Model (DLLM), LLaDA, to automatic speech recognition (ASR). Leveraging LLaDA as an external, deliberative processing module for transcripts from Whisper-LLaMA, we explored various masking strategies (random masking, low-confidence masking, and semi-automatic regression) utilizing bidirectional attention and denoising. On the LibriSpeech dataset, the best cascade system achieved a word error rate (WER) of 2.25% and 4.94% for the test-clean/test-other segments, demonstrating a 12.3% relative improvement over the Whisper-LLaMA baseline in the test-other segment. Furthermore, LLaDA, which uses only text without acoustic features, failed to improve accuracy, highlighting the importance of acoustically conditioned embeddings. Additionally, we evaluated Whisper-LLaDA as a standalone decoder for ASR using diffusion-based and semi-automatic regression decoding, and achieved faster inference speed than the Whisper-LLaMA baseline in most experimental settings, but with slightly lower recognition accuracy.

Takeaways, Limitations

Demonstrating the effectiveness of LLaDA as an external deliberation-based module for Whisper-LLaMA transcripts (WER reduction)
Emphasize the importance of utilizing bidirectional attention and denoising capabilities.
Confirming the importance of acoustic condition embedding
Demonstrating the potential of Whisper-LLaDA as a standalone decoder using diffusion-based and semi-automatic regression decoding (fast inference speed)
Slightly lower recognition accuracy when using a standalone decoder
This study is limited to a specific dataset (LibriSpeech), and further research is needed to determine whether the results generalize to other datasets.
The failure to improve accuracy when using simple text LLaDA raises the need for further research on how to integrate acoustic information.
👍