PAPER DIGEST
Most Influential SIGMOD 1998 Paper · 2026-03 edition

Enhanced Hypertext Categorization Using Hyperlinks

Soumen Chakrabarti; Byron Dom; Piotr Indyk

Venue
ACM SIGMOD Conference (SIGMOD) 1998
Recognition
Most Influential SIGMOD 1998 Paper (Rank No. 4)
Edition
2026-03
Impact factor
8
Certificate ID
dfcc042352bfa2f3

Abstract

A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and profile-based routing and filtering. Therefore, an accurate classifier is an essential component of a hypertext database. Hyperlinks pose new problems not addressed in the extensive text classification literature. Links clearly contain high-quality semantic clues that are lost upon a purely term-based classifier, but exploiting link information is non-trivial because it is noisy. Naive use of terms in the link neighborhood of a document can even <i>degrade</i> accuracy. Our contribution is to propose robust statistical models and a <i>relaxation labeling</i> technique for better classification by exploiting link information in a small neighborhood around documents. Our technique also adapts gracefully to the fraction of neighboring documents having known topics. We experimented with pre-classified samples from Yahoo!<sup>1</sup> and the US Patent Database<sup>2</sup>. In previous work, we developed a text classifier that misclassified only 13% of the documents in the well-known Reuters benchmark; this was comparable to the best results ever obtained. This classifier misclassified 36% of the patents, indicating that classifying hypertext can be more difficult than classifying text. Naively using terms in neighboring documents increased error to 38%; our hypertext classifier reduced it to 21%. Results with the Yahoo! sample were more dramatic: the text classifier showed 68% error, whereas our hypertext classifier reduced this to only 21%.

Download PDF certificate