PAPER DIGEST
Most Influential CIKM 1999 Paper · 2026-03 edition

A Probabilistic Description-oriented Approach For Categorizing Web Documents

Norbert Gö vert; Mounia Lalmas; Norbert Fuhr

Venue
ACM Conference on Information and Knowledge Management (CIKM) 1999
Recognition
Most Influential CIKM 1999 Paper (Rank No. 13)
Edition
2026-03
Impact factor
3
Certificate ID
2feb6e4b69721b48

Abstract

The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) using a representation of the content of web documents that captures these two characteristics and (2) using more effective classifiers. Our categorisation approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the <i>k</i>-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of the <i>k</i>-nearest neighbour classifier. Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorisation of web documents, and (2) a theoretical interpretation of the <i>k</i>-nearest neighbour classifier gives us improvement over the standard <i>k</i>-nearest neighbour classifier.

Download PDF certificate