@jjv

Smoothed Bloom Filter Language Models: Tera-Scale LMs on the Cheap

, and . Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), page 468--476. Prague, Czech Republic, Association for Computational Linguistics, (June 2007)

Abstract

A Bloom filter (BF) is a randomised data structure for set membership queries. Its space requirements fall significantly below lossless information-theoretic lower bounds but it produces false positives with some quantifiable probability. Here we present a general framework for deriving smoothed language model probabilities from BFs. We investigate how a BF containing n-gram statistics can be used as a direct replacement for a conventional n-gram model. Recent work has demonstrated that corpus statistics can be stored efficiently within a BF, here we consider how smoothed language model probabilities can be derived efficiently from this randomised representation. Our pro- posal takes advantage of the one-sided error guarantees of the BF and simple inequali- ties that hold between related n-gram statis- tics in order to further reduce the BF stor- age requirements and the error rate of the derived probabilities. We use these models as replacements for a conventional language model in machine translation experiments.

Links and resources

Tags

community

  • @jjv
  • @dblp
@jjv's tags highlighted