Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

HERCULES: Hierarchical Embedding-based Recursive Clustering Using LLMs for Efficient Summarization

Created by
  • Haebom

Author

Gabor Petnehazi, Bernadett Aradi

Outline

HERCULES is a novel algorithm and Python package that effectively groups complex datasets of various modalities (text, images, and numeric data) through hierarchical k-means clustering and provides semantically rich cluster descriptions generated using a Large Language Model (LLM). Starting from data points, it recursively applies k-means clustering to generate hierarchical cluster structures. It supports two representations: 'direct' mode (based on original data embeddings or scaled numeric features) and 'description' mode (based on embeddings of LLM-generated summaries). Users can provide topic_seed to direct LLM-generated summaries to specific topics and analyze the clustering results through interactive visualization tools.

Takeaways, Limitations

Takeaways:
Provides effective hierarchical clustering for data of various modalities.
Leveraging LLM to Improve Semantic Understanding of Clusters
Control clustering results through topic_seed
Provides interactive visualization tools for easy analysis and understanding
Presenting the possibility of extracting hierarchical knowledge from complex datasets.
Limitations:
Each modality can only be processed one at a time (multiple modalities cannot be processed simultaneously)
Dependency on LLM performance (if LLM performance deteriorates, accuracy and interpretability of results may deteriorate)
The effectiveness of topic_seed may depend on the user's expertise.
Further research is needed to determine optimal parameters for hierarchical clustering.
👍