Scaling to very very large corpora for natural language disambiguation
M. Banko, and E. Brill. ACL '01: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, page 26--33. Morristown, NJ, USA, Association for Computational Linguistics, (2001)
DOI: http://dx.doi.org/10.3115/1073012.1073017
Abstract
The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost.
Description
Scaling to very very large corpora for natural language disambiguation
%0 Conference Paper
%1 Banko01
%A Banko, Michele
%A Brill, Eric
%B ACL '01: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
%C Morristown, NJ, USA
%D 2001
%I Association for Computational Linguistics
%K LargeScale ambiguity
%P 26--33
%R http://dx.doi.org/10.3115/1073012.1073017
%T Scaling to very very large corpora for natural language disambiguation
%U http://portal.acm.org/citation.cfm?id=1073017
%X The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost.
@inproceedings{Banko01,
abstract = {The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost.},
added-at = {2008-09-24T11:45:53.000+0200},
address = {Morristown, NJ, USA},
author = {Banko, Michele and Brill, Eric},
biburl = {https://www.bibsonomy.org/bibtex/27d0bd5b964c1fc33bd303c0ecb143d47/mkroell},
booktitle = {ACL '01: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics},
description = {Scaling to very very large corpora for natural language disambiguation},
doi = {http://dx.doi.org/10.3115/1073012.1073017},
interhash = {6b6b98539e848e6d0fb9b427be12dd9e},
intrahash = {7d0bd5b964c1fc33bd303c0ecb143d47},
keywords = {LargeScale ambiguity},
location = {Toulouse, France},
pages = {26--33},
publisher = {Association for Computational Linguistics},
timestamp = {2008-12-23T14:33:16.000+0100},
title = {Scaling to very very large corpora for natural language disambiguation},
url = {http://portal.acm.org/citation.cfm?id=1073017},
year = 2001
}