Article,

Modeling non-uniformity in short-read rates in RNA-Seq data.

J. Li, H. Jiang, and W. Wong.
Genome Biol, 11 (5): R50 (2010)Having tried methods such as support vector machines and neural networks (Additional file 1), we settled on MART (multiple additive regression trees) as our final choice for a nonlinear model. Our results may benefit quantitative inference from RNA-Seq data. To reduce biases in gene expression estimates due to non-uniformity of read rates, we propose to estimate the expression of a single-isoform gene by the total number of reads along the gene divided by the sum of sequencing preferences (SSP) under our MART model. What is the reason for the failure of our highly predictive model for sequencing preferences to lead to more significant improvements in gene expression estimates? We believe the answer is that when a gene is large, the dramatic local variations in the sequencing preferences will be smoothed out when they are summed over many positions to produce the SSP for the whole gene. First, we downloaded from the UCSC genome browser website 30 the sequences of RefSeq genes 31,32 (mouse July 2007 mm9 for the Wold and Grimmond data, and human Feb 2009 hg19 for the Burge data). Then, we mapped the reads to all isoforms of the RefSeq genes. For Illumina data, we directly mapped the 25 or 32 nucleotide reads using SeqMap 33 , allowing two mismatches. For ABI data, we used the same strategy as described in Supplementary Figure 1 of 12 , where a three-round mapping for 35, 30 and 25 nucleotide qualified reads was performed separately. In each round, we used SOCS 34 as the mapping tool. After mapping, we selected genes that have only one isoform annotated in RefSeq and do not overlap with other genes, and called them 'non-overlapped single-isoform genes'. To avoid ambiguity, we only retained reads that map to a unique site and this site is within the unique genes. Then, we counted the number of reads whose mapping starts at each position of these unique genes, which gives the count data. Local Poisson model is explained. Short and supposedly clear methods part; read. Available at: R package 'mseq'.
DOI: 10.1186/gb-2010-11-5-r50

Abstract

After mapping, RNA-Seq data can be summarized by a sequence of read counts commonly modeled as Poisson variables with constant rates along each transcript, which actually fit data poorly. We suggest using variable rates for different positions, and propose two models to predict these rates based on local sequences. These models explain more than 50\% of the variations and can lead to improved estimates of gene and isoform expressions for both Illumina and Applied Biosystems data.

BibTeX key: Li2010
entry type: article
year: 2010
institution: Department of Statistics, Stanford University, Sequoia Hall, 390 Serra Mall, Stanford, CA 94305, USA. junli07@stanford.edu
journal: Genome Biol
number: 5
pages: R50
volume: 11
pii: gb-2010-11-5-r50
medline-pst: ppublish
pmid: 20459815
file: Li2010.pdf:Li2010.pdf:PDF
language: eng
DOI: 10.1186/gb-2010-11-5-r50
url: http://dx.doi.org/10.1186/gb-2010-11-5-r50
note: Having tried methods such as support vector machines and neural networks (Additional file 1), we settled on MART (multiple additive regression trees) as our final choice for a nonlinear model. Our results may benefit quantitative inference from RNA-Seq data. To reduce biases in gene expression estimates due to non-uniformity of read rates, we propose to estimate the expression of a single-isoform gene by the total number of reads along the gene divided by the sum of sequencing preferences (SSP) under our MART model. What is the reason for the failure of our highly predictive model for sequencing preferences to lead to more significant improvements in gene expression estimates? We believe the answer is that when a gene is large, the dramatic local variations in the sequencing preferences will be smoothed out when they are summed over many positions to produce the SSP for the whole gene. First, we downloaded from the UCSC genome browser website 30 the sequences of RefSeq genes 31,32 (mouse July 2007 mm9 for the Wold and Grimmond data, and human Feb 2009 hg19 for the Burge data). Then, we mapped the reads to all isoforms of the RefSeq genes. For Illumina data, we directly mapped the 25 or 32 nucleotide reads using SeqMap 33 , allowing two mismatches. For ABI data, we used the same strategy as described in Supplementary Figure 1 of 12 , where a three-round mapping for 35, 30 and 25 nucleotide qualified reads was performed separately. In each round, we used SOCS 34 as the mapping tool. After mapping, we selected genes that have only one isoform annotated in RefSeq and do not overlap with other genes, and called them 'non-overlapped single-isoform genes'. To avoid ambiguity, we only retained reads that map to a unique site and this site is within the unique genes. Then, we counted the number of reads whose mapping starts at each position of these unique genes, which gives the count data. Local Poisson model is explained. Short and supposedly clear methods part; read. Available at: R package 'mseq'

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@article{Li2010, abstract = {After mapping, RNA-Seq data can be summarized by a sequence of read counts commonly modeled as Poisson variables with constant rates along each transcript, which actually fit data poorly. We suggest using variable rates for different positions, and propose two models to predict these rates based on local sequences. These models explain more than 50\% of the variations and can lead to improved estimates of gene and isoform expressions for both Illumina and Applied Biosystems data.}, added-at = {2010-12-31T02:55:50.000+0100}, author = {Li, Jun and Jiang, Hui and Wong, Wing Hung}, biburl = {https://www.bibsonomy.org/bibtex/28e0658d6768c9e6cf80c2c9f501dd589/jabreftest}, doi = {10.1186/gb-2010-11-5-r50}, file = {Li2010.pdf:Li2010.pdf:PDF}, institution = {Department of Statistics, Stanford University, Sequoia Hall, 390 Serra Mall, Stanford, CA 94305, USA. junli07@stanford.edu}, interhash = {fbdc0b4cf1df8e3ef3c39f51fcc7b1cc}, intrahash = {8e0658d6768c9e6cf80c2c9f501dd589}, journal = {Genome Biol}, keywords = {RNA Genetic;PoissonDistribution;ProteinIsoforms metabolism;Exons Nonparametric methods;Statistics genetics;GeneExpressionProfiling;GeneExpressionRegulation;Humans;LinearModels;Mice;Models NucleicAcid;Embryo Mammalian Animals;ApolipoproteinsE genetics;BaseSequence;Databases genetics/metabolism;RNA genetics;SequenceAnalysis}, language = {eng}, medline-pst = {ppublish}, note = {Having tried methods such as support vector machines and neural networks (Additional file 1), we settled on MART (multiple additive regression trees) as our final choice for a nonlinear model. Our results may benefit quantitative inference from RNA-Seq data. To reduce biases in gene expression estimates due to non-uniformity of read rates, we propose to estimate the expression of a single-isoform gene by the total number of reads along the gene divided by the sum of sequencing preferences (SSP) under our MART model. What is the reason for the failure of our highly predictive model for sequencing preferences to lead to more significant improvements in gene expression estimates? We believe the answer is that when a gene is large, the dramatic local variations in the sequencing preferences will be smoothed out when they are summed over many positions to produce the SSP for the whole gene. First, we downloaded from the UCSC genome browser website [30] the sequences of RefSeq genes [31,32] (mouse July 2007 mm9 for the Wold and Grimmond data, and human Feb 2009 hg19 for the Burge data). Then, we mapped the reads to all isoforms of the RefSeq genes. For Illumina data, we directly mapped the 25 or 32 nucleotide reads using SeqMap [33] , allowing two mismatches. For ABI data, we used the same strategy as described in Supplementary Figure 1 of [12] , where a three-round mapping for 35, 30 and 25 nucleotide qualified reads was performed separately. In each round, we used SOCS [34] as the mapping tool. After mapping, we selected genes that have only one isoform annotated in RefSeq and do not overlap with other genes, and called them 'non-overlapped single-isoform genes'. To avoid ambiguity, we only retained reads that map to a unique site and this site is within the unique genes. Then, we counted the number of reads whose mapping starts at each position of these unique genes, which gives the count data. Local Poisson model is explained. Short and supposedly clear methods part; read. Available at: R package 'mseq'}, number = 5, pages = {R50}, pii = {gb-2010-11-5-r50}, pmid = {20459815}, timestamp = {2010-12-31T02:55:50.000+0100}, title = {Modeling non-uniformity in short-read rates in RNA-Seq data.}, url = {http://dx.doi.org/10.1186/gb-2010-11-5-r50}, volume = 11, year = 2010 }

BibSonomy

Modeling non-uniformity in short-read rates in RNA-Seq data.

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on