Article,

Modeling non-uniformity in short-read rates in RNA-Seq data.

, , and .
Genome Biol, 11 (5): R50 (2010)Having tried methods such as support vector machines and neural networks (Additional file 1), we settled on MART (multiple additive regression trees) as our final choice for a nonlinear model. Our results may benefit quantitative inference from RNA-Seq data. To reduce biases in gene expression estimates due to non-uniformity of read rates, we propose to estimate the expression of a single-isoform gene by the total number of reads along the gene divided by the sum of sequencing preferences (SSP) under our MART model. What is the reason for the failure of our highly predictive model for sequencing preferences to lead to more significant improvements in gene expression estimates? We believe the answer is that when a gene is large, the dramatic local variations in the sequencing preferences will be smoothed out when they are summed over many positions to produce the SSP for the whole gene. First, we downloaded from the UCSC genome browser website 30 the sequences of RefSeq genes 31,32 (mouse July 2007 mm9 for the Wold and Grimmond data, and human Feb 2009 hg19 for the Burge data). Then, we mapped the reads to all isoforms of the RefSeq genes. For Illumina data, we directly mapped the 25 or 32 nucleotide reads using SeqMap 33 , allowing two mismatches. For ABI data, we used the same strategy as described in Supplementary Figure 1 of 12 , where a three-round mapping for 35, 30 and 25 nucleotide qualified reads was performed separately. In each round, we used SOCS 34 as the mapping tool. After mapping, we selected genes that have only one isoform annotated in RefSeq and do not overlap with other genes, and called them 'non-overlapped single-isoform genes'. To avoid ambiguity, we only retained reads that map to a unique site and this site is within the unique genes. Then, we counted the number of reads whose mapping starts at each position of these unique genes, which gives the count data. Local Poisson model is explained. Short and supposedly clear methods part; read. Available at: R package 'mseq'.
DOI: 10.1186/gb-2010-11-5-r50

Abstract

After mapping, RNA-Seq data can be summarized by a sequence of read counts commonly modeled as Poisson variables with constant rates along each transcript, which actually fit data poorly. We suggest using variable rates for different positions, and propose two models to predict these rates based on local sequences. These models explain more than 50\% of the variations and can lead to improved estimates of gene and isoform expressions for both Illumina and Applied Biosystems data.

Tags

Users

  • @jabreftest

Comments and Reviews