Misc,

Extreme Value Theory Applied to Document Retrieval from Large Collections by

D. Madigan, Y. Vardi, and I. Weissman.
(2006)

Abstract

While this article went to press, Yehuda Vardi passed away. We dedicate the paper to his memory. We consider text retrieval applications that assign query-specific relevance scores to documents drawn from particular collections. Such applications represent a primary focus of the annual Text Retrieval Conference (TREC), where the participants compare the empirical performance of different approaches. P (K) , the proportion of the top K documents that are relevant, is a popular measure of retrieval effectiveness. Participants in the TREC Very Large Corpus track have observed that when the target is a random sample from a collection, P (K) is substantially smaller than when the target is the entire collection. Hawking and Robertson (2003) confirmed this finding in a number of experimental settings. Hawking et al. (1999) posed as an open research question the cause of this phenomenon and proposed five possible explanatory hypotheses. In this paper, we present a mathematical analysis that sheds some light on these hypotheses and complements the experimental work of Hawking and Robertson (2003). We will also introduce C (L) contamination at L, the number of irrelevant documents amongst the top L relevant documents, and describe its properties. Our analysis shows that while P (K) typically will increase with collection size, the phenomenon is not universal. That is, the asymptotic behavior of P (K) and C (L) depends on the score distributions and relative proportions of relevant and irrelevant documents in the collection. 1

BibTeX key: Madigan_extremevalue
entry type: misc
year: 2006
url: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.78.1053

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@misc{Madigan_extremevalue, abstract = {While this article went to press, Yehuda Vardi passed away. We dedicate the paper to his memory. We consider text retrieval applications that assign query-specific relevance scores to documents drawn from particular collections. Such applications represent a primary focus of the annual Text Retrieval Conference (TREC), where the participants compare the empirical performance of different approaches. P (K) , the proportion of the top K documents that are relevant, is a popular measure of retrieval effectiveness. Participants in the TREC Very Large Corpus track have observed that when the target is a random sample from a collection, P (K) is substantially smaller than when the target is the entire collection. Hawking and Robertson (2003) confirmed this finding in a number of experimental settings. Hawking et al. (1999) posed as an open research question the cause of this phenomenon and proposed five possible explanatory hypotheses. In this paper, we present a mathematical analysis that sheds some light on these hypotheses and complements the experimental work of Hawking and Robertson (2003). We will also introduce C (L) contamination at L, the number of irrelevant documents amongst the top L relevant documents, and describe its properties. Our analysis shows that while P (K) typically will increase with collection size, the phenomenon is not universal. That is, the asymptotic behavior of P (K) and C (L) depends on the score distributions and relative proportions of relevant and irrelevant documents in the collection. 1}, added-at = {2015-03-10T03:05:01.000+0100}, author = {Madigan, David and Vardi, Yehuda and Weissman, Ishay}, biburl = {https://www.bibsonomy.org/bibtex/2c41da1f3c0cc2ba60b3bef20bdc51ff8/estebancacavelo}, description = {CiteSeerX — Extreme Value Theory Applied to Document Retrieval from Large Collections by}, interhash = {f08b0833a1a5d53f88365c9825b18399}, intrahash = {c41da1f3c0cc2ba60b3bef20bdc51ff8}, keywords = {retrieval}, timestamp = {2015-03-10T03:05:01.000+0100}, title = {Extreme Value Theory Applied to Document Retrieval from Large Collections by}, url = {http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.78.1053}, year = 2006 }

BibSonomy

Extreme Value Theory Applied to Document Retrieval from Large Collections by

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on