Article,

GTC: how to maintain huge genotype collections in a compressed form

A. Danek, and S. Deorowicz.
Bioinformatics, 34 (11): 1834-1840 (January 2018)
DOI: 10.1093/bioinformatics/bty023

Abstract

Nowadays, genome sequencing is frequently used in many research centers. In projects, such as the Haplotype Reference Consortium or the Exome Aggregation Consortium, huge databases of genotypes in large populations are determined. Together with the increasing size of these collections, the need for fast and memory frugal ways of representation and searching in them becomes crucial.We present GTC (GenoType Compressor), a novel compressed data structure for representation of huge collections of genetic variation data. It significantly outperforms existing solutions in terms of compression ratio and time of answering various types of queries. We show that the largest of publicly available database of about 60 000 haplotypes at about 40 million SNPs can be stored in \<4 GB, while the queries related to variants are answered in a fraction of a second.GTC can be downloaded from https://github.com/refresh-bio/GTC or http://sun.aei.polsl.pl/REFRESH/gtc.Supplementary data are available at Bioinformatics online.

BibTeX key: danek2018maintain
entry type: article
year: 2018
month: 01
journal: Bioinformatics
number: 11
pages: 1834-1840
volume: 34
eprint: https://academic.oup.com/bioinformatics/article-pdf/34/11/1834/25121502/bty023.pdf
issn: 1367-4803
DOI: 10.1093/bioinformatics/bty023
url: https://doi.org/10.1093/bioinformatics/bty023

BibSonomy

GTC: how to maintain huge genotype collections in a compressed form

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on