Background Methods for predicting protein function
directly from amino acid sequences are useful tools in
the study of uncharacterised protein families and in
comparative genomics. Until now, this problem has been
approached using machine learning techniques that
attempt to predict membership, or otherwise, to
predefined functional categories or subcellular
locations. A potential drawback of this approach is
that the human-designated functional classes may not
accurately reflect the underlying biology, and
consequently important sequence-to-function
relationships may be missed. Results We show that a
self-supervised data mining approach is able to find
relationships between sequence features and functional
annotations. No preconceived ideas about functional
categories are required, and the training data is
simply a set of protein sequences and their
UniProt/Swiss-Prot annotations. The main technical
aspect of the approach is the co-evolution of amino
acid-based regular expressions and keyword-based
logical expressions with genetic programming. Our
experiments on a strictly non-redundant set of
eukaryotic proteins reveal that the strongest and most
easily detected sequence-to-function relationships are
concerned with targeting to various cellular
compartments, which is an area already well studied
both experimentally and computationally. Of more
interest are a number of broad functional roles which
can also be correlated with sequence features. These
include inhibition, biosynthesis, transcription and
defence against bacteria. Despite substantial overlaps
between these functions and their corresponding
cellular compartments, we find clear differences in the
sequence motifs used to predict some of these
functions. For example, the presence of polyglutamine
repeats appears to be linked more strongly to the
"transcription" function than to the general
"nuclear" function/location. Conclusion We have
developed a novel and useful approach for knowledge
discovery in annotated sequence data. The technique is
able to identify functionally important sequence
features and does not require expert knowledge. By
viewing protein function from a sequence perspective,
the approach is also suitable for discovering
unexpected links between biological processes, such as
the recently discovered role of ubiquitination in
transcription.
%0 Journal Article
%1 oai:biomedcentral.com:1471-2105-7-16
%A Brameier, Markus
%A Haan, Josien
%A Krings, Andrea
%A MacCallum, Robert M
%D 2006
%I BioMed Central Ltd.
%J BMC bioinformatics electronic resource
%K algorithms, genetic programming
%N 16
%R doi:10.1186/1471-2105-7-16
%T Automatic discovery of cross-family sequence features
associated with protein function
%U http://www.biomedcentral.com/1471-2105/7/16
%V 7
%X Background Methods for predicting protein function
directly from amino acid sequences are useful tools in
the study of uncharacterised protein families and in
comparative genomics. Until now, this problem has been
approached using machine learning techniques that
attempt to predict membership, or otherwise, to
predefined functional categories or subcellular
locations. A potential drawback of this approach is
that the human-designated functional classes may not
accurately reflect the underlying biology, and
consequently important sequence-to-function
relationships may be missed. Results We show that a
self-supervised data mining approach is able to find
relationships between sequence features and functional
annotations. No preconceived ideas about functional
categories are required, and the training data is
simply a set of protein sequences and their
UniProt/Swiss-Prot annotations. The main technical
aspect of the approach is the co-evolution of amino
acid-based regular expressions and keyword-based
logical expressions with genetic programming. Our
experiments on a strictly non-redundant set of
eukaryotic proteins reveal that the strongest and most
easily detected sequence-to-function relationships are
concerned with targeting to various cellular
compartments, which is an area already well studied
both experimentally and computationally. Of more
interest are a number of broad functional roles which
can also be correlated with sequence features. These
include inhibition, biosynthesis, transcription and
defence against bacteria. Despite substantial overlaps
between these functions and their corresponding
cellular compartments, we find clear differences in the
sequence motifs used to predict some of these
functions. For example, the presence of polyglutamine
repeats appears to be linked more strongly to the
"transcription" function than to the general
"nuclear" function/location. Conclusion We have
developed a novel and useful approach for knowledge
discovery in annotated sequence data. The technique is
able to identify functionally important sequence
features and does not require expert knowledge. By
viewing protein function from a sequence perspective,
the approach is also suitable for discovering
unexpected links between biological processes, such as
the recently discovered role of ubiquitination in
transcription.
@article{oai:biomedcentral.com:1471-2105-7-16,
abstract = {Background Methods for predicting protein function
directly from amino acid sequences are useful tools in
the study of uncharacterised protein families and in
comparative genomics. Until now, this problem has been
approached using machine learning techniques that
attempt to predict membership, or otherwise, to
predefined functional categories or subcellular
locations. A potential drawback of this approach is
that the human-designated functional classes may not
accurately reflect the underlying biology, and
consequently important sequence-to-function
relationships may be missed. Results We show that a
self-supervised data mining approach is able to find
relationships between sequence features and functional
annotations. No preconceived ideas about functional
categories are required, and the training data is
simply a set of protein sequences and their
UniProt/Swiss-Prot annotations. The main technical
aspect of the approach is the co-evolution of amino
acid-based regular expressions and keyword-based
logical expressions with genetic programming. Our
experiments on a strictly non-redundant set of
eukaryotic proteins reveal that the strongest and most
easily detected sequence-to-function relationships are
concerned with targeting to various cellular
compartments, which is an area already well studied
both experimentally and computationally. Of more
interest are a number of broad functional roles which
can also be correlated with sequence features. These
include inhibition, biosynthesis, transcription and
defence against bacteria. Despite substantial overlaps
between these functions and their corresponding
cellular compartments, we find clear differences in the
sequence motifs used to predict some of these
functions. For example, the presence of polyglutamine
repeats appears to be linked more strongly to the
{"}transcription{"} function than to the general
{"}nuclear{"} function/location. Conclusion We have
developed a novel and useful approach for knowledge
discovery in annotated sequence data. The technique is
able to identify functionally important sequence
features and does not require expert knowledge. By
viewing protein function from a sequence perspective,
the approach is also suitable for discovering
unexpected links between biological processes, such as
the recently discovered role of ubiquitination in
transcription.},
added-at = {2008-06-19T17:35:00.000+0200},
author = {Brameier, Markus and Haan, Josien and Krings, Andrea and MacCallum, Robert M},
bibsource = {OAI-PMH server at www.biomedcentral.com},
biburl = {https://www.bibsonomy.org/bibtex/2dce235c02e5f81ec75d14988f43df44c/brazovayeye},
doi = {doi:10.1186/1471-2105-7-16},
interhash = {f43b14221f6198fe1245a86fccdd734b},
intrahash = {dce235c02e5f81ec75d14988f43df44c},
issn = {1471-2105},
journal = {BMC bioinformatics [electronic resource]},
keywords = {algorithms, genetic programming},
language = {en},
month = {January~12},
notes = {PMID: 16409628},
number = 16,
oai = {oai:biomedcentral.com:1471-2105-7-16},
publisher = {BioMed Central Ltd.},
rights = {Copyright 2006 Brameier et al; licensee BioMed Central
Ltd.},
size = {16 pages},
timestamp = {2008-06-19T17:36:55.000+0200},
title = {Automatic discovery of cross-family sequence features
associated with protein function},
url = {http://www.biomedcentral.com/1471-2105/7/16},
volume = 7,
year = 2006
}