| Authors: |
Chin-Sheng Yu
and Yu-Ching Chen
and Chih-Hao Lu
and Jenn-Kang Hwang
|
| URL: |
http://dx.doi.org/10.1002/prot.21018 |
| Tags: |
SubCellLoc
|
| Abstract: |
Because the protein's function is usually related to its subcellular
localization, the ability to predict subcellular localization directly
from protein sequences will be useful for inferring protein functions.
Recent years have seen a surging interest in the development of novel
computational tools to predict subcellular localization. At present,
these approaches, based on a wide range of algorithms, have achieved
varying degrees of success for specific organisms and for certain
localization categories. A number of authors have noticed that sequence
similarity is useful in predicting subcellular localization. For
example, Nair and Rost (Protein Sci 2002;11:2836-2847) have carried
out extensive analysis of the relation between sequence similarity
and identity in subcellular localization, and have found a close
relationship between them above a certain similarity threshold. However,
many existing benchmark data sets used for the prediction accuracy
assessment contain highly homologous sequences-some data sets comprising
sequences up to 80-90\% sequence identity. Using these benchmark
test data will surely lead to overestimation of the performance of
the methods considered. Here, we develop an approach based on a two-level
support vector machine (SVM) system: the first level comprises a
number of SVM classifiers, each based on a specific type of feature
vectors derived from sequences; the second level SVM classifier functions
as the jury machine to generate the probability distribution of decisions
for possible localizations. We compare our approach with a global
sequence alignment approach and other existing approaches for two
benchmark data sets-one comprising prokaryotic sequences and the
other eukaryotic sequences. Furthermore, we carried out all-against-all
sequence alignment for several data sets to investigate the relationship
between sequence homology and subcellular localization. Our results,
which are consistent with previous studies, indicate that the homology
search approach performs well down to 30\% sequence identity, although
its performance deteriorates considerably for sequences sharing lower
sequence identity. A data set of high homology levels will undoubtedly
lead to biased assessment of the performances of the predictive approaches-especially
those relying on homology search or sequence annotations. Our two-level
classification system based on SVM does not rely on homology search;
therefore, its performance remains relatively unaffected by sequence
homology. When compared with other approaches, our approach performed
significantly better. Furthermore, we also develop a practical hybrid
method, which combines the two-level SVM classifier and the homology
search method, as a general tool for the sequence annotation of subcellular
localization. |
@article{Yu2006,
title = {Prediction of protein subcellular localization},
author = {Chin-Sheng Yu and Yu-Ching Chen and Chih-Hao Lu and Jenn-Kang Hwang},
journal = {Proteins: Structure, Function and Bioinformatics},
month = {August},
number = {3},
pages = {643--651},
url = {http://dx.doi.org/10.1002/prot.21018},
volume = {64},
year = {2006},
abstract = {Because the protein's function is usually related to its subcellular
localization, the ability to predict subcellular localization directly
from protein sequences will be useful for inferring protein functions.
Recent years have seen a surging interest in the development of novel
computational tools to predict subcellular localization. At present,
these approaches, based on a wide range of algorithms, have achieved
varying degrees of success for specific organisms and for certain
localization categories. A number of authors have noticed that sequence
similarity is useful in predicting subcellular localization. For
example, Nair and Rost (Protein Sci 2002;11:2836-2847) have carried
out extensive analysis of the relation between sequence similarity
and identity in subcellular localization, and have found a close
relationship between them above a certain similarity threshold. However,
many existing benchmark data sets used for the prediction accuracy
assessment contain highly homologous sequences-some data sets comprising
sequences up to 80-90\% sequence identity. Using these benchmark
test data will surely lead to overestimation of the performance of
the methods considered. Here, we develop an approach based on a two-level
support vector machine (SVM) system: the first level comprises a
number of SVM classifiers, each based on a specific type of feature
vectors derived from sequences; the second level SVM classifier functions
as the jury machine to generate the probability distribution of decisions
for possible localizations. We compare our approach with a global
sequence alignment approach and other existing approaches for two
benchmark data sets-one comprising prokaryotic sequences and the
other eukaryotic sequences. Furthermore, we carried out all-against-all
sequence alignment for several data sets to investigate the relationship
between sequence homology and subcellular localization. Our results,
which are consistent with previous studies, indicate that the homology
search approach performs well down to 30\% sequence identity, although
its performance deteriorates considerably for sequences sharing lower
sequence identity. A data set of high homology levels will undoubtedly
lead to biased assessment of the performances of the predictive approaches-especially
those relying on homology search or sequence annotations. Our two-level
classification system based on SVM does not rely on homology search;
therefore, its performance remains relatively unaffected by sequence
homology. When compared with other approaches, our approach performed
significantly better. Furthermore, we also develop a practical hybrid
method, which combines the two-level SVM classifier and the homology
search method, as a general tool for the sequence annotation of subcellular
localization.},
timestamp = {2007.05.18}, owner = {Marco},
keywords = {SubCellLoc }
}