A study of using search engine page hits as a proxy for n-gram frequencies

Abstract

The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using Web search engine page hit counts as estimates for n-gram frequencies. While the results so far have been very encouraging, some researchers worry about what appears to be the instability of these estimates. Using a particular NLP task, we compare the variability in the n-gram counts cross different search engines as well as for the same search engine across time, finding that although there are measurable differences, they are not statistically significantly different for the task examined.

BibTeX key: Nakov:Hearst:05b
entry type: inproceedings
booktitle: Proceedings of Recent Advances in Natural Language Processing 2005
year: 2005
Document: http://biotext.berkeley.edu/papers/nakov_ranlp2005.pdf

BibSonomy

A study of using search engine page hits as a proxy for n-gram frequencies

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on