Abstract
We present an algorithm which is able to extract
discriminant rules from oligopeptides for protease
proteolytic cleavage activity prediction. The algorithm
is developed using previous genetic programming. Three
important components in the algorithm are a min-max
scoring function, the reverse Polish notation (RPN) and
the use of minimum description length. The min-max
scoring function is developed using amino acid
similarity matrices for measuring the similarity
between an oligopeptide and a rule, which is a complex
algebraic equation of amino acids rather than a simple
pattern sequence. The Fisher ratio is then calculated
on the scoring values using the class label associated
with the oligopeptides. The discriminant ability of
each rule can therefore be evaluated. The use of RPN
makes the evolutionary operations simpler and therefore
reduces the computational cost. To prevent overfitting,
the concept of minimum description length is used to
penalize over-complicated rules. A fitness function is
therefore composed of the Fisher ratio and the use of
minimum description length for an efficient
evolutionary process. In the application to four
protease datasets (Trypsin, Factor Xa, Hepatitis C
Virus and HIV protease cleavage site prediction), our
algorithm is superior to C5, a conventional method for
deriving decision trees.
Users
Please
log in to take part in the discussion (add own reviews or comments).