Smoothed Bloom Filter Language Models: Tera-Scale LMs on the Cheap
D. Talbot, and M. Osborne. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), page 468--476. Prague, Czech Republic, Association for Computational Linguistics, (June 2007)
Abstract
A Bloom filter (BF) is a randomised data structure for set membership queries. Its space requirements fall significantly below lossless information-theoretic lower bounds but it produces false positives with some quantifiable probability. Here we present a general framework for deriving smoothed language model probabilities from BFs. We investigate how a BF containing n-gram statistics can be used as a direct replacement for a conventional n-gram model. Recent work has demonstrated that corpus statistics can be stored efficiently within a BF, here we consider how smoothed language model probabilities can be derived efficiently from this randomised representation. Our pro- posal takes advantage of the one-sided error guarantees of the BF and simple inequali- ties that hold between related n-gram statis- tics in order to further reduce the BF stor- age requirements and the error rate of the derived probabilities. We use these models as replacements for a conventional language model in machine translation experiments.
%0 Conference Paper
%1 talbot-osborne:2007:EMNLP-CoNLL2007
%A Talbot, David
%A Osborne, Miles
%B Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)
%C Prague, Czech Republic
%D 2007
%I Association for Computational Linguistics
%K Bloom LM compression filter lossy smoothing
%P 468--476
%T Smoothed Bloom Filter Language Models: Tera-Scale LMs on the Cheap
%U http://www.aclweb.org/anthology/D/D07/D07-1049
%X A Bloom filter (BF) is a randomised data structure for set membership queries. Its space requirements fall significantly below lossless information-theoretic lower bounds but it produces false positives with some quantifiable probability. Here we present a general framework for deriving smoothed language model probabilities from BFs. We investigate how a BF containing n-gram statistics can be used as a direct replacement for a conventional n-gram model. Recent work has demonstrated that corpus statistics can be stored efficiently within a BF, here we consider how smoothed language model probabilities can be derived efficiently from this randomised representation. Our pro- posal takes advantage of the one-sided error guarantees of the BF and simple inequali- ties that hold between related n-gram statis- tics in order to further reduce the BF stor- age requirements and the error rate of the derived probabilities. We use these models as replacements for a conventional language model in machine translation experiments.
@inproceedings{talbot-osborne:2007:EMNLP-CoNLL2007,
abstract = {A Bloom filter (BF) is a randomised data structure for set membership queries. Its space requirements fall significantly below lossless information-theoretic lower bounds but it produces false positives with some quantifiable probability. Here we present a general framework for deriving smoothed language model probabilities from BFs. We investigate how a BF containing n-gram statistics can be used as a direct replacement for a conventional n-gram model. Recent work has demonstrated that corpus statistics can be stored efficiently within a BF, here we consider how smoothed language model probabilities can be derived efficiently from this randomised representation. Our pro- posal takes advantage of the one-sided error guarantees of the BF and simple inequali- ties that hold between related n-gram statis- tics in order to further reduce the BF stor- age requirements and the error rate of the derived probabilities. We use these models as replacements for a conventional language model in machine translation experiments.},
added-at = {2008-11-26T10:37:53.000+0100},
address = {Prague, Czech Republic},
author = {Talbot, David and Osborne, Miles},
biburl = {https://www.bibsonomy.org/bibtex/22ed49b4ba976df063e30bc2d29066c73/jjv},
booktitle = {Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)},
interhash = {0989d43768e141073d423ead28f98a02},
intrahash = {2ed49b4ba976df063e30bc2d29066c73},
keywords = {Bloom LM compression filter lossy smoothing},
month = {June},
pages = {468--476},
publisher = {Association for Computational Linguistics},
timestamp = {2008-11-26T10:37:53.000+0100},
title = {Smoothed {B}loom Filter Language Models: Tera-Scale {LM}s on the Cheap},
url = {http://www.aclweb.org/anthology/D/D07/D07-1049},
year = 2007
}