J. Li, H. Jiang, und W. Wong. Genome Biol, 11 (5):
R50(2010)Having tried methods such as support vector machines and neural networks (Additional file 1), we settled on MART (multiple additive regression trees) as our final choice for a nonlinear model. Our results may benefit quantitative inference from RNA-Seq data. To reduce biases in gene expression estimates due to non-uniformity of read rates, we propose to estimate the expression of a single-isoform gene by the total number of reads along the gene divided by the sum of sequencing preferences (SSP) under our MART model. What is the reason for the failure of our highly predictive model for sequencing preferences to lead to more significant improvements in gene expression estimates? We believe the answer is that when a gene is large, the dramatic local variations in the sequencing preferences will be smoothed out when they are summed over many positions to produce the SSP for the whole gene. First, we downloaded from the UCSC genome browser website 30 the sequences of RefSeq genes 31,32 (mouse July 2007 mm9 for the Wold and Grimmond data, and human Feb 2009 hg19 for the Burge data). Then, we mapped the reads to all isoforms of the RefSeq genes. For Illumina data, we directly mapped the 25 or 32 nucleotide reads using SeqMap 33 , allowing two mismatches. For ABI data, we used the same strategy as described in Supplementary Figure 1 of 12 , where a three-round mapping for 35, 30 and 25 nucleotide qualified reads was performed separately. In each round, we used SOCS 34 as the mapping tool. After mapping, we selected genes that have only one isoform annotated in RefSeq and do not overlap with other genes, and called them 'non-overlapped single-isoform genes'. To avoid ambiguity, we only retained reads that map to a unique site and this site is within the unique genes. Then, we counted the number of reads whose mapping starts at each position of these unique genes, which gives the count data. Local Poisson model is explained. Short and supposedly clear methods part; read. Available at: R package 'mseq'.
H. Jiang, und W. Wong. Bioinformatics, 25 (8):
1026--1032(April 2009)assumes poisson distribution of reads; non uniformity of the read distribution is discussed later on. more complex splicing events than exon skipping also needs to be evaluated, they say..
B. Howard, und S. Heber. BMC Bioinformatics, (2010)Seqmap is used as alignment tool which might be slow. Arabidopsis is used.. takes into account the non-uniformity of RNA-Seq read positions along the targeted transcripts. assumed that the set of splice variants is known; the goal is to estimate the relative expression levels of these isoforms in a mixture. Implementation The algorithm described above was implemented in Java, with matrix computations by the JAMA matrix library (available at http://math.nist.gov/javanumerics/jama/ webcite). Data analyses and simulations were also performed using the R statistical programming language (http://www.r-project.org/ webcite). Real RNA-Seq datasets For each dataset, reads were mapped to the transcriptome using the SOAP v2 alignment program 15. TAIR 8 was used to define the tested gene models. Differential splicing We first used a chi-square test of subset counts to identify genes that were differentially spliced between the two conditions. “Multireads,�? or reads that map to more than one gene, are another important problem for accurate isoform quantification..