PAPER DIGEST
Most Influential CIKM 2002 Paper · 2026-03 edition

COOLCAT: An Entropy-based Algorithm For Categorical Clustering

Daniel Barbará Yi Li; Julia Couto

Venue
ACM Conference on Information and Knowledge Management (CIKM) 2002
Recognition
Most Influential CIKM 2002 Paper (Rank No. 3)
Edition
2026-03
Impact factor
6
Certificate ID
f3aef1b491671583

Abstract

In this paper we explore the connection between clustering categorical data and entropy: clusters of similar poi lower entropy than those of dissimilar ones. We use this connection to design an incremental heuristic algorithm, COOLCAT, which is capable of efficiently clustering large data sets of records with categorical attributes, and data streams. In contrast with other categorical clustering algorithms published in the past, COOLCAT's clustering results are very stable for different sample sizes and parameter settings. Also, the criteria for clustering is a very intuitive one, since it is deeply rooted on the well-known notion of entropy. Most importantly, COOLCAT is well equipped to deal with clustering of <i>data streams</i>(continuously arriving streams of data point) since it is an incremental algorithm capable of clustering new points without having to look at every point that has been clustered so far. We demonstrate the efficiency and scalability of COOLCAT by a series of experiments on real and synthetic data sets.

Download PDF certificate