Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Categorical Data Clustering via Value Order Estimated Distance Metric Learning

Created by
  • Haebom

Author

Yiqun Zhang, Mingjie Zhao, Hong Jia, Yang Lu, Mengke Li, Yiu-ming Cheung

Outline

This paper proposes a novel distance measure to address the clustering problem of categorical data. Existing categorical data lack a clear metric space, such as Euclidean distance, which can lead to information loss during the clustering process. To address this, this paper presents a novel ordinal distance measure that learns the optimal ordering relationship between categorical attribute values and quantifies distances in a linear space, similar to numerical attributes. Considering the ambiguity and fuzziness of subjective categorical values, we develop a novel joint learning paradigm that learns the ordinal distance measure simultaneously with the clustering process. This method offers low time complexity and guaranteed convergence, achieving excellent clustering accuracy on categorical and mixed datasets. The learned ordinal distance measure facilitates the understanding and management of non-intuitive categorical data. The effectiveness of the proposed method was verified through extensive experiments, and the source code has been made available.

Takeaways, Limitations

Takeaways:
Improving clustering performance for categorical data: Achieving better clustering accuracy than existing methods.
Improved understanding and management of categorical data: Learned ordinal distance measures make categorical data easier to interpret and utilize.
Presenting an efficient collaborative learning paradigm: We propose a collaborative learning method with low time complexity and guaranteed convergence.
Providing open source code: Increases reproducibility and scalability.
Limitations:
Further research is needed to evaluate the generalization performance of the proposed method (including extended experiments on various datasets and clustering algorithms).
Efficiency verification for high-dimensional categorical data is required.
Research is needed to determine the optimal parameter settings for a specific dataset.
👍