Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval
S. Kuruppu, S. Puglisi, and J. Zobel. String Processing and Information Retrieval, page 201--206. Berlin, Heidelberg, Springer Berlin Heidelberg, (2010)
Abstract
Self-indexes -- data structures that simultaneously provide fast search of and access to compressed text -- are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our `RLZ' approach is to store a self-index for a base sequence and then compress every other sequence as an LZ77 encoding relative to the base. For a collection of r sequences totaling N bases, with a total of s point mutations from a base sequence of length n, this representation requires just \$nH\_k(T) + s\backslashlog n + s\backslashlog \backslashfrac\N\\s\ + O(s)\$bits. At the cost of negligible extra space, access to ℓ consecutive symbols requires \$\backslashO(\backslashell + \backslashlog n)\$time. Our experiments show that, for example, RLZ can represent individual human genomes in around 0.1 bits per base while supporting rapid access and using relatively little memory.
%0 Conference Paper
%1 kuruppu2010relative
%A Kuruppu, Shanika
%A Puglisi, Simon J.
%A Zobel, Justin
%B String Processing and Information Retrieval
%C Berlin, Heidelberg
%D 2010
%E Chavez, Edgar
%E Lonardi, Stefano
%I Springer Berlin Heidelberg
%K algorithms data_structures genome_data
%P 201--206
%T Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval
%X Self-indexes -- data structures that simultaneously provide fast search of and access to compressed text -- are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our `RLZ' approach is to store a self-index for a base sequence and then compress every other sequence as an LZ77 encoding relative to the base. For a collection of r sequences totaling N bases, with a total of s point mutations from a base sequence of length n, this representation requires just \$nH\_k(T) + s\backslashlog n + s\backslashlog \backslashfrac\N\\s\ + O(s)\$bits. At the cost of negligible extra space, access to ℓ consecutive symbols requires \$\backslashO(\backslashell + \backslashlog n)\$time. Our experiments show that, for example, RLZ can represent individual human genomes in around 0.1 bits per base while supporting rapid access and using relatively little memory.
%@ 978-3-642-16321-0
@inproceedings{kuruppu2010relative,
abstract = {Self-indexes -- data structures that simultaneously provide fast search of and access to compressed text -- are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our `RLZ' approach is to store a self-index for a base sequence and then compress every other sequence as an LZ77 encoding relative to the base. For a collection of r sequences totaling N bases, with a total of s point mutations from a base sequence of length n, this representation requires just {\$}nH{\_}k(T) + s{\backslash}log n + s{\backslash}log {\backslash}frac{\{}N{\}}{\{}s{\}} + O(s){\$}bits. At the cost of negligible extra space, access to ℓ consecutive symbols requires {\$}{\backslash}O({\backslash}ell + {\backslash}log n){\$}time. Our experiments show that, for example, RLZ can represent individual human genomes in around 0.1 bits per base while supporting rapid access and using relatively little memory.},
added-at = {2020-06-29T22:01:44.000+0200},
address = {Berlin, Heidelberg},
author = {Kuruppu, Shanika and Puglisi, Simon J. and Zobel, Justin},
biburl = {https://www.bibsonomy.org/bibtex/2900c19605b96445d1d36e7a77ada4f56/peter.ralph},
booktitle = {String Processing and Information Retrieval},
editor = {Chavez, Edgar and Lonardi, Stefano},
interhash = {5995cda2406305a138687901c7cf273c},
intrahash = {900c19605b96445d1d36e7a77ada4f56},
isbn = {978-3-642-16321-0},
keywords = {algorithms data_structures genome_data},
pages = {201--206},
publisher = {Springer Berlin Heidelberg},
timestamp = {2020-06-29T22:01:44.000+0200},
title = {Relative {Lempel}-{Ziv} Compression of Genomes for Large-Scale Storage and Retrieval},
year = 2010
}