Ranking Retrieval Systems Without Relevance Judgments

Ian Soboroff; Charles Nicholas; Patrick Cahan

Venue: ACM SIGIR Conference (SIGIR) 2001
Recognition: Most Influential SIGIR 2001 Paper (Rank No. 12)
Edition: 2026-03
Impact factor: 5
Certificate ID: 88913bcb77e5a7e5

Abstract

The most prevalent experimental methodology for comparing the effectiveness of information retrieval systems requires a test collection, composed of a set of documents, a set of query topics, and a set of relevance judgments indicating which documents are relevant to which topics. It is well known that relevance judgments are not infallible, but recent retrospective investigation into results from the Text REtrieval Conference (TREC) has shown that differences in human judgments of relevance do not affect the relative measured performance of retrieval systems. Based on this result, we propose and describe the initial results of a new evaluation methodology which replaces human relevance judgments with a randomly selected mapping of documents to topics which we refer to as<i>pseudo-relevance judgments</i>.Rankings of systems with our methodology correlate positively with official TREC rankings, although the performance of the top systems is not predicted well. The correlations are stable over a variety of pool depths and sampling techniques. With improvements, such a methodology could be useful in evaluating systems such as World-Wide Web search engines, where the set of documents changes too often to make traditional collection construction techniques practical.

Download PDF certificate