This paper proposes a novel method for unsupervised document clustering using multimodal embeddings that leverage various modalities (text, layout information, and visual features). Beyond simple document type classification (e.g., invoices, purchase orders), we aim to achieve more granular document understanding by distinguishing different templates within the same document type. We evaluate the performance of embeddings generated using state-of-the-art multimodal pre-trained models, including SBERT, LayoutLMv1, LayoutLMv3, DiT, Donut, ColPali, Gemma3, and InternVL3, by applying them to clustering algorithms such as $k$-Means, DBSCAN, HDBSCAN with $k$-NN, and BIRCH. Experimental results demonstrate the potential of multimodal embeddings to improve document clustering performance, suggesting their potential for diverse applications, including intelligent document processing, document layout analysis, and unsupervised document classification. Furthermore, we analyze the strengths and weaknesses of various multimodal models and suggest future research directions.