This study evaluates the performance of three advanced speech encoder models—Wav2Vec 2.0, XLS-R, and Whisper—on speaker identification tasks. We fine-tuned these models and analyzed their layer-by-layer representations using SVCCA, k-means clustering, and t-SNE visualization. We found that Wav2Vec 2.0 and XLS-R effectively capture speaker-specific features in the early layers, and fine-tuning improves stability and performance. Whisper performs better at deeper layers. We also determined the optimal number of transformer layers for each model when fine-tuning for the speaker identification task.