PAPER DIGEST
Most Influential SIGMOD 2004 Paper · 2026-03 edition

Efficient Set Joins On Similarity Predicates

Sunita Sarawagi; Alok Kirpal

Venue
ACM SIGMOD Conference (SIGMOD) 2004
Recognition
Most Influential SIGMOD 2004 Paper (Rank No. 8)
Edition
2026-03
Impact factor
6
Certificate ID
deb3f487842e3831

Abstract

In this paper we present an efficient, scalable and general algorithm for performing set joins on predicates involving various similarity measures like intersect size, Jaccard-coefficient, cosine similarity, and edit-distance. This expands the existing suite of algorithms for set joins on simpler predicates such as, set containment, equality and non-zero overlap. We start with a basic inverted index based probing method and add a sequence of optimizations that result in one to two orders of magnitude improvement in running time. The algorithm folds in a data partitioning strategy that can work efficiently with an index compressed to fit in any available amount of main memory. The optimizations used in our algorithm generalize to several weighted and unweighted measures of partial word overlap between sets.

Download PDF certificate