Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

VINP: Variational Bayesian Inference with Neural Speech Prior for Joint ASR-Effective Speech Dereverberation and Blind RIR Identification

Created by
  • Haebom

Author

Pengyu Wang, Ying Fang, Xiaofei Li

Outline

This paper proposes Variational Inference with Neural Speech Prior (VINP), a novel method for simultaneously estimating anechoic speech and room impulse response (RIR) from reverberant speech. VINP builds a probabilistic signal model in the time-frequency domain and utilizes a neural network-based variational Bayesian inference (VBI) framework for estimating anechoic speech priors. Unlike conventional single-channel reverberation cancellation methods, VINP is effective for automatic speech recognition (ASR) systems and estimates waveforms through maximum a posteriori probability (MAP) and maximum likelihood (ML) estimation of anechoic speech and RIR. Experimental results demonstrate state-of-the-art performance in Mean Opinion Score (MOS) and Word Error Rate (WER), as well as superior performance in estimating Reverberation Time at 60 dB (RT60) and Direct-to-Reverberation Ratio (DRR). Code and audio samples are available online.

Takeaways, Limitations

Takeaways:
By combining variational Bayes inference and neural network-based prior distribution, we effectively solve the problems of single-channel echo cancellation and blind-vision RIR identification.
We have achieved state-of-the-art performance that is directly applicable to automatic speech recognition systems.
It also showed excellent performance in RT60 and DRR estimation.
We've made the code and audio samples public to increase reproducibility.
Limitations:
The paper lacks specific references to Limitations or future research directions.
Further analysis is needed to determine the generalizability of performance to specific environments or speech data.
Detailed descriptions of the neural network architecture and hyperparameters used may be lacking.
👍