PAPER DIGEST
Most Influential SIGMOD 1996 Paper · 2026-03 edition

BIRCH: An Efficient Data Clustering Method For Very Large Databases

Tian Zhang; Raghu Ramakrishnan; Miron Livny

Venue
ACM SIGMOD Conference (SIGMOD) 1996
Recognition
Most Influential SIGMOD 1996 Paper (Rank No. 1)
Edition
2026-03
Impact factor
9
Certificate ID
2c90a603cf042655

Abstract

Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of <i>clusters,</i> or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named <i>BIRCH</i> (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. <i>BIRCH</i> incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). <i>BIRCH</i> can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. <i>BIRCH</i> is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate <i>BIRCH</i>'s time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of <i>BIRCH</i> versus <i>CLARANS,</i> a clustering method proposed recently for large datasets, and show that <i>BIRCH</i> is consistently superior.

Download PDF certificate