Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Evaluating the Effectiveness of Transformer Layers in Wav2Vec 2.0, XLS-R, and Whisper for Speaker Identification Tasks

Created by
  • Haebom

Author

Linus Stuhlmann, Michael Alexander Saxer

Outline

This study evaluates the performance of three advanced speech encoder models—Wav2Vec 2.0, XLS-R, and Whisper—on speaker identification tasks. We fine-tuned these models and analyzed their layer-by-layer representations using SVCCA, k-means clustering, and t-SNE visualization. We found that Wav2Vec 2.0 and XLS-R effectively capture speaker-specific features in the early layers, and fine-tuning improves stability and performance. Whisper performs better at deeper layers. We also determined the optimal number of transformer layers for each model when fine-tuning for the speaker identification task.

Takeaways, Limitations

Wav2Vec 2.0 and XLS-R effectively capture speaker-specific features in the early layers.
Fine-tuning improves the stability and performance of the model.
Whisper performs better at deeper layers.
Determining the optimal number of transformer layers for each model for speaker identification.
👍