Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Unsupervised Document and Template Clustering using Multimodal Embeddings

Created by
  • Haebom

Author

Phillipe R. Sampaio, Helene Maxcici

Outline

This paper proposes a novel method for unsupervised document clustering using multimodal embeddings that leverage various modalities (text, layout information, and visual features). Beyond simple document type classification (e.g., invoices, purchase orders), we aim to achieve more granular document understanding by distinguishing different templates within the same document type. We evaluate the performance of embeddings generated using state-of-the-art multimodal pre-trained models, including SBERT, LayoutLMv1, LayoutLMv3, DiT, Donut, ColPali, Gemma3, and InternVL3, by applying them to clustering algorithms such as $k$-Means, DBSCAN, HDBSCAN with $k$-NN, and BIRCH. Experimental results demonstrate the potential of multimodal embeddings to improve document clustering performance, suggesting their potential for diverse applications, including intelligent document processing, document layout analysis, and unsupervised document classification. Furthermore, we analyze the strengths and weaknesses of various multimodal models and suggest future research directions.

Takeaways, Limitations

Takeaways:
Demonstrating the effectiveness of unsupervised document clustering using multimodal embeddings.
A novel approach for granular document understanding and classification.
Provide guidelines for selecting the optimal model through comparative analysis of the performance of various multi-modal models.
It presents potential applications in various fields such as intelligent document processing, document layout analysis, and unsupervised document classification.
Limitations:
Further analysis of the types and performance of multimodal models used is needed.
Possible bias towards certain types of documents or layouts.
Generalization performance evaluation in real-world applications is needed.
Scalability verification for large document datasets is required.
👍