This paper focuses on the communicative behavior of the marmoset. Marmosets are primates with diverse and complex vocalizations. Unlike human speech, their vocalizations are less structured and more variable, and they are recorded in noisy environments, making analysis difficult. To address these challenges, we pre-trained a Transformer model using Masked Autoencoders (MAE), a self-supervised learning method. Compared to CNNs, the MAE-pretrained Transformer outperformed marmosets in sound segmentation, classification, and speaker identification tasks. These results demonstrate the utility of self-supervised learning-based Transformer models in studying non-human communication in resource-poor environments.