This paper presents a method for training a Conformer-based encoder that generates unique embeddings for small audio segments using a self-supervised contrastive learning framework. By leveraging Conformer's ability to capture local and global interactions, we achieve state-of-the-art performance on audio retrieval tasks, generating embeddings from only 3 seconds of audio. Furthermore, we maintain this state-of-the-art performance while remaining virtually immune to temporal misalignment and other audio artifacts, such as noise, reverberation, and extreme time stretching. We train and test our model on publicly available datasets of various sizes, and we also make the code and model publicly available to ensure reproducibility of our results.