PAPER DIGEST
Most Influential SIGIR 2000 Paper · 2026-03 edition

Evaluating Evaluation Measure Stability

Chris Buckley; Ellen M. Voorhees

Venue
ACM SIGIR Conference (SIGIR) 2000
Recognition
Most Influential SIGIR 2000 Paper (Rank No. 12)
Edition
2026-03
Impact factor
5
Certificate ID
d8b0116714932784

Abstract

This paper presents a novel way of examining the accuracy of the evaluation measures commonly used in information retrieval experiments. It validates several of the rules-of-thumb experimenters use, such as the number of queries needed for a good experiment is at least 25 and 50 is better, while challenging other beliefs, such as the common evaluation measures are equally reliable. As an example, we show that Precision at 30 documents has about twice the average error rate as Average Precision has. These results can help information retrieval researchers design experiments that provide a desired level of confidence in their results. In particular, we suggest researchers using Web measures such as Precision at 10 documents will need to use many more than 50 queries or will have to require two methods to have a very large difference in evaluation scores before concluding that the two methods are actually different.

Download PDF certificate