In this paper, we present CaLLiPer, a novel multimodal representation learning framework for predicting an individual's next location. To address the limitations of existing methods, such as the lack of explicit spatial information, the difficulty in integrating rich urban semantic contexts, and the problem of processing unknown locations, CaLLiPer adopts a contrastive learning approach that fuses spatial coordinates and semantic features of interest points. Experimental results show that CaLLiPer outperforms existing methods, especially in situations where new locations appear. We encourage reproducibility and follow-up research by making the code and data publicly available.