Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Mamba2 Meets Silence: Robust Vocal Source Separation for Sparse Regions

Created by
  • Haebom

Author

Euiyeon Kim, Yong-Hoon Choi

Outline

This paper presents a novel source separation model specialized for accurate vocal separation. To overcome the difficulty of existing Transformer-based models in capturing intermittent vocals, we leverage Mamba2, a state-of-the-art state-space model that better captures long-term temporal dependencies. To efficiently process long input sequences, we combine a band-splitting strategy with a dual-path architecture. Experimental results demonstrate that the proposed model outperforms current state-of-the-art models, achieving a cSDR (best-in-class) of 11.03 dB and demonstrating significant performance improvements even at uSDR. Furthermore, it demonstrates stable and consistent performance across a wide range of input lengths and vocal occurrence patterns. These results demonstrate the effectiveness of the Mamba-based model for high-resolution audio processing and suggest new directions for broader applications in audio research.

Takeaways, Limitations

Takeaways:
Leveraging a Mamba2-based model, we overcome the limitations of existing Transformer-based models and significantly improve vocal separation performance (a cSDR of 11.03 dB).
We propose a method to efficiently process long input sequences using a band-splitting strategy and a dual-path architecture.
Its stable performance across a wide range of input lengths and vocal occurrence patterns enhances its potential for practical applications.
We demonstrate the utility of Mamba-based models in high-resolution audio processing.
Limitations:
This paper does not provide a detailed explanation of the specific implementation of the Mamba2 model or hyperparameter tuning.
Performance evaluations for other types of sound source separation (e.g., instrument separation) were not presented.
Further analysis of generalization performance on datasets other than real-world music datasets is needed.
👍