PAPER DIGEST
Most Influential CIKM 2004 Paper · 2026-03 edition

Stemming And Lemmatization In The Clustering Of Finnish Text Documents

Tuomo Korenius; Jorma Laurikkala; Kalervo Jä rvelin; Martti Juhola

Venue
ACM Conference on Information and Knowledge Management (CIKM) 2004
Recognition
Most Influential CIKM 2004 Paper (Rank No. 7)
Edition
2026-03
Impact factor
5
Certificate ID
1277ea60058eb85d

Abstract

Stemming and lemmatization were compared in the clustering of Finnish text documents. Since Finnish is a highly inflectional and agglutinative language, we hypothesized that lemmatization, involving splitting of the compound words, would be more appropriate normalization approach than the straightforward stemming. The relevance of the documents were evaluated with a four-point relevance assessment scale, which was collapsed into binary one by considering all the relevant and only the highly relevant documents relevant, respectively. Experiments with four hierarchical clustering methods supported the hypothesis. The stringent relevance scale showed that lemmatization allowed the single and complete linkage methods to recover especially the highly relevant documents better than stemming. In comparison with stemming, lemmatization together with the average linkage and Ward's methods produced higher precision. We conclude that lemmatization is a better word normalization method than stemming, when Finnish text documents are clustered for information retrieval.

Download PDF certificate