PAPER DIGEST
Most Influential SIGIR 2002 Paper · 2026-03 edition

Unsupervised Document Classification Using Sequential Information Maximization

Noam Slonim; Nir Friedman; Naftali Tishby

Venue
ACM SIGIR Conference (SIGIR) 2002
Recognition
Most Influential SIGIR 2002 Paper (Rank No. 8)
Edition
2026-03
Impact factor
6
Certificate ID
303a540a2a535515

Abstract

We present a novel sequential clustering algorithm which is motivated by the <i>Information Bottleneck (IB)</i> method. In contrast to the agglomerative <i>IB</i> algorithm, the new sequential (<i>sIB</i>) approach is guaranteed to converge to a local maximum of the information with time and space complexity typically linear in the data size. information, as required by the original IB principle. Moreover, the time and space complexity are significantly improved. We apply this algorithm to unsupervised document classification. In our evaluation, on small and medium size corpora, the <i>sIB</i> is found to be consistently superior to all the other clustering methods we examine, typically by a significant margin. Moreover, the <i>sIB</i> results are comparable to those obtained by a <i>supervised</i> Naive Bayes classifier. Finally, we propose a simple procedure for trading cluster's recall to gain higher precision, and show how this approach can extract clusters which match the existing topics of the corpus almost perfectly.

Download PDF certificate