copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

W. Touw, J. Bayjanov, L. Overmars, L. Backus, J. Boekhorst, M. Wels, and S. Hijum. Briefings in Bioinformatics, (July 2012)
DOI: 10.1093/bib/bbs034

Abstract

In the Life Sciences ‘omics’ data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are required to extract the complex non-linear trends present in omics data. Classification techniques allow training a model based on variables (e.g. SNPs in genetic association studies) to separate different classes (e.g. healthy subjects versus patients). Random Forest (RF) is a versatile classification algorithm suited for the analysis of these large data sets. In the Life Sciences, RF is popular because RF classification models have a high-prediction accuracy and provide information on importance of variables for classification. For omics data, variables or conditional relations between variables are typically important for a subset of samples of the same class. For example: within a class of cancer patients certain SNP combinations may be important for a subset of patients that have a specific subtype of cancer, but not important for a different subset of patients. These conditional relationships can in principle be uncovered from the data with RF as these are implicitly taken into account by the algorithm during the creation of the classification model. This review details some of the to the best of our knowledge rarely or never used RF properties that allow maximizing the biological insights that can be extracted from complex omics data sets using RF.

Links and resources

BibTeX key: touw_data_2012
entry type: article
year: 2012
month: jul
journal: Briefings in Bioinformatics
issn: 1467-5463, 1477-4054
shorttitle: Data mining in the Life Sciences with Random Forest
language: en
DOI: 10.1093/bib/bbs034
urldate: 2012-07-16
url: http://bib.oxfordjournals.org/content/early/2012/07/10/bib.bbs034

Cite this publication

%0 Journal Article %1 touw_data_2012 %A Touw, Wouter G. %A Bayjanov, Jumamurat R. %A Overmars, Lex %A Backus, Lennart %A Boekhorst, Jos %A Wels, Michiel %A Hijum, Sacha A. F. T. van %D 2012 %J Briefings in Bioinformatics %K Forest, Learning, Machine Random algorithms, conditional importance, interaction model relationships, selection, variable %R 10.1093/bib/bbs034 %T Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle? %U http://bib.oxfordjournals.org/content/early/2012/07/10/bib.bbs034 %X In the Life Sciences ‘omics’ data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are required to extract the complex non-linear trends present in omics data. Classification techniques allow training a model based on variables (e.g. SNPs in genetic association studies) to separate different classes (e.g. healthy subjects versus patients). Random Forest (RF) is a versatile classification algorithm suited for the analysis of these large data sets. In the Life Sciences, RF is popular because RF classification models have a high-prediction accuracy and provide information on importance of variables for classification. For omics data, variables or conditional relations between variables are typically important for a subset of samples of the same class. For example: within a class of cancer patients certain SNP combinations may be important for a subset of patients that have a specific subtype of cancer, but not important for a different subset of patients. These conditional relationships can in principle be uncovered from the data with RF as these are implicitly taken into account by the algorithm during the creation of the classification model. This review details some of the to the best of our knowledge rarely or never used RF properties that allow maximizing the biological insights that can be extracted from complex omics data sets using RF.

@article{touw_data_2012, abstract = {In the Life Sciences ‘omics’ data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are required to extract the complex non-linear trends present in omics data. Classification techniques allow training a model based on variables (e.g. SNPs in genetic association studies) to separate different classes (e.g. healthy subjects versus patients). Random Forest (RF) is a versatile classification algorithm suited for the analysis of these large data sets. In the Life Sciences, RF is popular because RF classification models have a high-prediction accuracy and provide information on importance of variables for classification. For omics data, variables or conditional relations between variables are typically important for a subset of samples of the same class. For example: within a class of cancer patients certain SNP combinations may be important for a subset of patients that have a specific subtype of cancer, but not important for a different subset of patients. These conditional relationships can in principle be uncovered from the data with RF as these are implicitly taken into account by the algorithm during the creation of the classification model. This review details some of the to the best of our knowledge rarely or never used RF properties that allow maximizing the biological insights that can be extracted from complex omics data sets using RF.}, added-at = {2017-01-09T13:57:26.000+0100}, author = {Touw, Wouter G. and Bayjanov, Jumamurat R. and Overmars, Lex and Backus, Lennart and Boekhorst, Jos and Wels, Michiel and Hijum, Sacha A. F. T. van}, biburl = {https://www.bibsonomy.org/bibtex/2af4bc8046339b654241b2577b156364b/yourwelcome}, doi = {10.1093/bib/bbs034}, interhash = {59bc2bd4d3f72f176de99ee43b7c2307}, intrahash = {af4bc8046339b654241b2577b156364b}, issn = {1467-5463, 1477-4054}, journal = {Briefings in Bioinformatics}, keywords = {Forest, Learning, Machine Random algorithms, conditional importance, interaction model relationships, selection, variable}, language = {en}, month = jul, shorttitle = {Data mining in the {Life} {Sciences} with {Random} {Forest}}, timestamp = {2017-01-09T14:01:11.000+0100}, title = {Data mining in the {Life} {Sciences} with {Random} {Forest}: a walk in the park or lost in the jungle?}, url = {http://bib.oxfordjournals.org/content/early/2012/07/10/bib.bbs034}, urldate = {2012-07-16}, year = 2012 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

Abstract

Links and resources

Tags

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

Abstract

Links and resources

Tags

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

Comments and Reviews
(0)