Misc,

Extreme Value Theory Applied to Document Retrieval from Large Collections by

, , and .
(2006)

Abstract

While this article went to press, Yehuda Vardi passed away. We dedicate the paper to his memory. We consider text retrieval applications that assign query-specific relevance scores to documents drawn from particular collections. Such applications represent a primary focus of the annual Text Retrieval Conference (TREC), where the participants compare the empirical performance of different approaches. P (K) , the proportion of the top K documents that are relevant, is a popular measure of retrieval effectiveness. Participants in the TREC Very Large Corpus track have observed that when the target is a random sample from a collection, P (K) is substantially smaller than when the target is the entire collection. Hawking and Robertson (2003) confirmed this finding in a number of experimental settings. Hawking et al. (1999) posed as an open research question the cause of this phenomenon and proposed five possible explanatory hypotheses. In this paper, we present a mathematical analysis that sheds some light on these hypotheses and complements the experimental work of Hawking and Robertson (2003). We will also introduce C (L) contamination at L, the number of irrelevant documents amongst the top L relevant documents, and describe its properties. Our analysis shows that while P (K) typically will increase with collection size, the phenomenon is not universal. That is, the asymptotic behavior of P (K) and C (L) depends on the score distributions and relative proportions of relevant and irrelevant documents in the collection. 1

Tags

Users

  • @estebancacavelo

Comments and Reviews