Scaling distributional similarity to large corpora
J. Gorman, и J. Curran. ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, стр. 361--368. Morristown, NJ, USA, Association for Computational Linguistics, (2006)
DOI: http://dx.doi.org/10.3115/1220175.1220221
Аннотация
Accurately representing synonymy using distributional similarity requires large volumes of data to reliably represent infrequent words. However, the naïve nearest-neighbour approach to comparing context vectors extracted from large corpora scales poorly (O(n2) in the vocabulary size).In this paper, we compare several existing approaches to approximating the nearest-neighbour search for distributional similarity. We investigate the trade-off between efficiency and accuracy, and find that SASH (Houle and Sakuma, 2005) provides the best balance.
Описание
Scaling distributional similarity to large corpora
ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
%0 Conference Paper
%1 1220221
%A Gorman, James
%A Curran, James R.
%B ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
%C Morristown, NJ, USA
%D 2006
%I Association for Computational Linguistics
%K similarity
%P 361--368
%R http://dx.doi.org/10.3115/1220175.1220221
%T Scaling distributional similarity to large corpora
%U http://portal.acm.org/citation.cfm?id=1220175.1220221&coll=GUIDE&dl=ACM&CFID=14547294&CFTOKEN=67134592
%X Accurately representing synonymy using distributional similarity requires large volumes of data to reliably represent infrequent words. However, the naïve nearest-neighbour approach to comparing context vectors extracted from large corpora scales poorly (O(n2) in the vocabulary size).In this paper, we compare several existing approaches to approximating the nearest-neighbour search for distributional similarity. We investigate the trade-off between efficiency and accuracy, and find that SASH (Houle and Sakuma, 2005) provides the best balance.
@inproceedings{1220221,
abstract = {Accurately representing synonymy using distributional similarity requires large volumes of data to reliably represent infrequent words. However, the naïve nearest-neighbour approach to comparing context vectors extracted from large corpora scales poorly (O(n2) in the vocabulary size).In this paper, we compare several existing approaches to approximating the nearest-neighbour search for distributional similarity. We investigate the trade-off between efficiency and accuracy, and find that SASH (Houle and Sakuma, 2005) provides the best balance.},
added-at = {2008-12-09T23:36:11.000+0100},
address = {Morristown, NJ, USA},
author = {Gorman, James and Curran, James R.},
biburl = {https://www.bibsonomy.org/bibtex/2b3cac6faeb3e9f6363d263a0affc5808/jamesh},
booktitle = {ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics},
description = {Scaling distributional similarity to large corpora},
doi = {http://dx.doi.org/10.3115/1220175.1220221},
interhash = {a9bddd9d18cfbe2af544662837718ceb},
intrahash = {b3cac6faeb3e9f6363d263a0affc5808},
keywords = {similarity},
location = {Sydney, Australia},
pages = {361--368},
publisher = {Association for Computational Linguistics},
timestamp = {2008-12-09T23:36:11.000+0100},
title = {Scaling distributional similarity to large corpora},
url = {http://portal.acm.org/citation.cfm?id=1220175.1220221&coll=GUIDE&dl=ACM&CFID=14547294&CFTOKEN=67134592},
year = 2006
}