Article,

A comparison of per sample global scaling and per gene normalization methods for differential expression analysis of RNA-seq data

X. Li, G. Brock, E. Rouchka, N. Cooper, D. Wu, T. O'Toole, R. Gill, A. Eteleeb, L. O'Brien, and S. Rai.
PLoS One, (2017)
DOI: 10.1371/journal.pone.0176185

Abstract

Normalization is an essential step with considerable impact on high-throughput RNA sequencing (RNA-seq) data analysis. Although there are numerous methods for read count normalization, it remains a challenge to choose an optimal method due to multiple factors contributing to read count variability that affects the overall sensitivity and specificity. In order to properly determine the most appropriate normalization methods, it is critical to compare the performance and shortcomings of a representative set of normalization routines based on different dataset characteristics. Therefore, we set out to evaluate the performance of the commonly used methods (DESeq, TMM-edgeR, FPKM-CuffDiff, TC, Med UQ and FQ) and two new methods we propose: Med-pgQ2 and UQ-pgQ2 (per-gene normalization after per-sample median or upper-quartile global scaling). Our per-gene normalization approach allows for comparisons between conditions based on similar count levels. Using the benchmark Microarray Quality Control Project (MAQC) and simulated datasets, we performed differential gene expression analysis to evaluate these methods. When evaluating MAQC2 with two replicates, we observed that Med-pgQ2 and UQ-pgQ2 achieved a slightly higher area under the Receiver Operating Characteristic Curve (AUC), a specificity rate > 85%, the detection power > 92% and an actual false discovery rate (FDR) under 0.06 given the nominal FDR (≤0.05). Although the top commonly used methods (DESeq and TMM-edgeR) yield a higher power (>93%) for MAQC2 data, they trade off with a reduced specificity (<70%) and a slightly higher actual FDR than our proposed methods. In addition, the results from an analysis based on the qualitative characteristics of sample distribution for MAQC2 and human breast cancer datasets show that only our gene-wise normalization methods corrected data skewed towards lower read counts. However, when we evaluated MAQC3 with less variation in five replicates, all methods performed similarly. Thus, our proposed Med-pgQ2 and UQ-pgQ2 methods perform slightly better for differential gene analysis of RNA-seq data skewed towards lowly expressed read counts with high variation by improving specificity while maintaining a good detection power with a control of the nominal FDR level.

BibTeX key: Li:2017:PLoS-One:28459823
entry type: article
year: 2017
journal: PLoS One
number: 5
volume: 12
pmid: 28459823
DOI: 10.1371/journal.pone.0176185
url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411036/

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

%0 Journal Article %1 Li:2017:PLoS-One:28459823 %A Li, X %A Brock, G N %A Rouchka, E C %A Cooper, N G F %A Wu, D %A O'Toole, T E %A Gill, R S %A Eteleeb, A M %A O'Brien, L %A Rai, S N %D 2017 %J PLoS One %K MUSTREAD deseq edgeR fpkm fulltext methods normalization rna-seq rpkm %N 5 %R 10.1371/journal.pone.0176185 %T A comparison of per sample global scaling and per gene normalization methods for differential expression analysis of RNA-seq data %U https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411036/ %V 12 %X Normalization is an essential step with considerable impact on high-throughput RNA sequencing (RNA-seq) data analysis. Although there are numerous methods for read count normalization, it remains a challenge to choose an optimal method due to multiple factors contributing to read count variability that affects the overall sensitivity and specificity. In order to properly determine the most appropriate normalization methods, it is critical to compare the performance and shortcomings of a representative set of normalization routines based on different dataset characteristics. Therefore, we set out to evaluate the performance of the commonly used methods (DESeq, TMM-edgeR, FPKM-CuffDiff, TC, Med UQ and FQ) and two new methods we propose: Med-pgQ2 and UQ-pgQ2 (per-gene normalization after per-sample median or upper-quartile global scaling). Our per-gene normalization approach allows for comparisons between conditions based on similar count levels. Using the benchmark Microarray Quality Control Project (MAQC) and simulated datasets, we performed differential gene expression analysis to evaluate these methods. When evaluating MAQC2 with two replicates, we observed that Med-pgQ2 and UQ-pgQ2 achieved a slightly higher area under the Receiver Operating Characteristic Curve (AUC), a specificity rate > 85%, the detection power > 92% and an actual false discovery rate (FDR) under 0.06 given the nominal FDR (≤0.05). Although the top commonly used methods (DESeq and TMM-edgeR) yield a higher power (>93%) for MAQC2 data, they trade off with a reduced specificity (<70%) and a slightly higher actual FDR than our proposed methods. In addition, the results from an analysis based on the qualitative characteristics of sample distribution for MAQC2 and human breast cancer datasets show that only our gene-wise normalization methods corrected data skewed towards lower read counts. However, when we evaluated MAQC3 with less variation in five replicates, all methods performed similarly. Thus, our proposed Med-pgQ2 and UQ-pgQ2 methods perform slightly better for differential gene analysis of RNA-seq data skewed towards lowly expressed read counts with high variation by improving specificity while maintaining a good detection power with a control of the nominal FDR level.

@article{Li:2017:PLoS-One:28459823, abstract = {Normalization is an essential step with considerable impact on high-throughput RNA sequencing (RNA-seq) data analysis. Although there are numerous methods for read count normalization, it remains a challenge to choose an optimal method due to multiple factors contributing to read count variability that affects the overall sensitivity and specificity. In order to properly determine the most appropriate normalization methods, it is critical to compare the performance and shortcomings of a representative set of normalization routines based on different dataset characteristics. Therefore, we set out to evaluate the performance of the commonly used methods (DESeq, TMM-edgeR, FPKM-CuffDiff, TC, Med UQ and FQ) and two new methods we propose: Med-pgQ2 and UQ-pgQ2 (per-gene normalization after per-sample median or upper-quartile global scaling). Our per-gene normalization approach allows for comparisons between conditions based on similar count levels. Using the benchmark Microarray Quality Control Project (MAQC) and simulated datasets, we performed differential gene expression analysis to evaluate these methods. When evaluating MAQC2 with two replicates, we observed that Med-pgQ2 and UQ-pgQ2 achieved a slightly higher area under the Receiver Operating Characteristic Curve (AUC), a specificity rate > 85%, the detection power > 92% and an actual false discovery rate (FDR) under 0.06 given the nominal FDR (≤0.05). Although the top commonly used methods (DESeq and TMM-edgeR) yield a higher power (>93%) for MAQC2 data, they trade off with a reduced specificity (<70%) and a slightly higher actual FDR than our proposed methods. In addition, the results from an analysis based on the qualitative characteristics of sample distribution for MAQC2 and human breast cancer datasets show that only our gene-wise normalization methods corrected data skewed towards lower read counts. However, when we evaluated MAQC3 with less variation in five replicates, all methods performed similarly. Thus, our proposed Med-pgQ2 and UQ-pgQ2 methods perform slightly better for differential gene analysis of RNA-seq data skewed towards lowly expressed read counts with high variation by improving specificity while maintaining a good detection power with a control of the nominal FDR level.}, added-at = {2018-10-10T08:43:15.000+0200}, author = {Li, X and Brock, G N and Rouchka, E C and Cooper, N G F and Wu, D and O'Toole, T E and Gill, R S and Eteleeb, A M and O'Brien, L and Rai, S N}, biburl = {https://www.bibsonomy.org/bibtex/2ee667ee339404a725d37264d8ad570ff/marcsaric}, description = {A comparison of per sample global scaling and per gene normalization methods for differential expression analysis of RNA-seq data}, doi = {10.1371/journal.pone.0176185}, interhash = {a54d2d4517ec8df5334250dacc2e4287}, intrahash = {ee667ee339404a725d37264d8ad570ff}, journal = {PLoS One}, keywords = {MUSTREAD deseq edgeR fpkm fulltext methods normalization rna-seq rpkm}, number = 5, pmid = {28459823}, timestamp = {2018-10-10T08:43:15.000+0200}, title = {A comparison of per sample global scaling and per gene normalization methods for differential expression analysis of RNA-seq data}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411036/}, volume = 12, year = 2017 }

BibSonomy

A comparison of per sample global scaling and per gene normalization methods for differential expression analysis of RNA-seq data

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on