PAPER DIGEST
Most Influential WWW 2007 Paper · 2026-03 edition

Detecting Near-duplicates For Web Crawling

Gurmeet Singh Manku; Arvind Jain; Anish Das Sarma

Venue
ACM Web Conference (WWW) 2007
Recognition
Most Influential WWW 2007 Paper (Rank No. 11)
Edition
2026-03
Impact factor
7
Certificate ID
0343b4c9fce056d7

Abstract

Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar's fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and all batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.

Download PDF certificate