Article,

Random sampling from a search engine's index

Z. Bar-Yossef, and M. Gurevich.
J. ACM, 55 (5): 24:1--24:74 (November 2008)
DOI: 10.1145/1411509.1411514

Abstract

We revisit a problem introduced by Bharat and Broder almost a decade ago: How to sample random pages from the corpus of documents indexed by a search engine, using only the search engine's public interface&quest; Such a primitive is particularly useful in creating objective benchmarks for search engines. The technique of Bharat and Broder suffers from a well-recorded bias: it favors long documents. In this article we introduce two novel sampling algorithms: a lexicon-based algorithm and a random walk algorithm. Our algorithms produce biased samples, but each sample is accompanied by a weight, which represents its bias. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to four well-known Monte Carlo simulation methods: rejection sampling, importance sampling, the Metropolis--Hastings algorithm, and the Maximum Degree method. The limited access to search engines force our algorithms to use bias weights that are only “approximate”. We characterize analytically the effect of approximate bias weights on Monte Carlo methods and conclude that our algorithms are guaranteed to produce near-uniform samples from the search engine's corpus. Our study of approximate Monte Carlo methods could be of independent interest. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long documents. We use our algorithms to collect comparative statistics about the corpora of the Google, MSN Search, and Yahoo&excl; search engines.

BibTeX key: Bar-Yossef:2008:RSS:1411509.1411514
entry type: article
address: New York, NY, USA
year: 2008
month: nov
journal: J. ACM
number: 5
pages: 24:1--24:74
publisher: ACM
volume: 55
issn: 0004-5411
acmid: 1411514
numpages: 74
issue_date: October 2008
articleno: 24
DOI: 10.1145/1411509.1411514
url: http://doi.acm.org/10.1145/1411509.1411514

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

%0 Journal Article %1 Bar-Yossef:2008:RSS:1411509.1411514 %A Bar-Yossef, Ziv %A Gurevich, Maxim %C New York, NY, USA %D 2008 %I ACM %J J. ACM %K engine random sampling search %N 5 %P 24:1--24:74 %R 10.1145/1411509.1411514 %T Random sampling from a search engine's index %U http://doi.acm.org/10.1145/1411509.1411514 %V 55 %X We revisit a problem introduced by Bharat and Broder almost a decade ago: How to sample random pages from the corpus of documents indexed by a search engine, using only the search engine's public interface&quest; Such a primitive is particularly useful in creating objective benchmarks for search engines. The technique of Bharat and Broder suffers from a well-recorded bias: it favors long documents. In this article we introduce two novel sampling algorithms: a lexicon-based algorithm and a random walk algorithm. Our algorithms produce biased samples, but each sample is accompanied by a weight, which represents its bias. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to four well-known Monte Carlo simulation methods: rejection sampling, importance sampling, the Metropolis--Hastings algorithm, and the Maximum Degree method. The limited access to search engines force our algorithms to use bias weights that are only “approximate”. We characterize analytically the effect of approximate bias weights on Monte Carlo methods and conclude that our algorithms are guaranteed to produce near-uniform samples from the search engine's corpus. Our study of approximate Monte Carlo methods could be of independent interest. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long documents. We use our algorithms to collect comparative statistics about the corpora of the Google, MSN Search, and Yahoo&excl; search engines.

@article{Bar-Yossef:2008:RSS:1411509.1411514, abstract = {We revisit a problem introduced by Bharat and Broder almost a decade ago: How to sample random pages from the corpus of documents indexed by a search engine, using only the search engine's public interface&quest; Such a primitive is particularly useful in creating objective benchmarks for search engines. The technique of Bharat and Broder suffers from a well-recorded bias: it favors long documents. In this article we introduce two novel sampling algorithms: a lexicon-based algorithm and a random walk algorithm. Our algorithms produce biased samples, but each sample is accompanied by a weight, which represents its bias. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to four well-known Monte Carlo simulation methods: rejection sampling, importance sampling, the Metropolis--Hastings algorithm, and the Maximum Degree method. The limited access to search engines force our algorithms to use bias weights that are only “approximate”. We characterize analytically the effect of approximate bias weights on Monte Carlo methods and conclude that our algorithms are guaranteed to produce near-uniform samples from the search engine's corpus. Our study of approximate Monte Carlo methods could be of independent interest. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long documents. We use our algorithms to collect comparative statistics about the corpora of the Google, MSN Search, and Yahoo&excl; search engines.}, acmid = {1411514}, added-at = {2012-05-03T00:46:36.000+0200}, address = {New York, NY, USA}, articleno = {24}, author = {Bar-Yossef, Ziv and Gurevich, Maxim}, biburl = {https://www.bibsonomy.org/bibtex/2de446bc7c93c729e7fb5d40b11bf31b9/emrahcem}, description = {Random sampling from a search engine's index}, doi = {10.1145/1411509.1411514}, interhash = {a3dd0e39543b51b33486ee99e58eaa31}, intrahash = {de446bc7c93c729e7fb5d40b11bf31b9}, issn = {0004-5411}, issue_date = {October 2008}, journal = {J. ACM}, keywords = {engine random sampling search}, month = nov, number = 5, numpages = {74}, pages = {24:1--24:74}, publisher = {ACM}, timestamp = {2012-05-03T00:46:36.000+0200}, title = {Random sampling from a search engine's index}, url = {http://doi.acm.org/10.1145/1411509.1411514}, volume = 55, year = 2008 }

BibSonomy

Random sampling from a search engine's index

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on