Sequence analysis of large protein families can produce sub-clusters even within the same family. In some cases, it is of interest to know precisely which amino acid position variations are most responsible for driving separation into sub-clusters. In large protein families composed of large proteins, it can be quite challenging to assign the relative importance to specific amino acid positions. Principal components analysis (PCA) is ideal for such a task, since the problem is posed in a large variable space, i.e. the number of amino acids that make up the protein sequence, and PCA is powerful at reducing the dimensionality of complex problems by projecting the data into an eigenspace that represents the directions of greatest variation. However, PCA of aligned protein sequence families is complicated by the fact that protein sequences are traditionally represented by single letter alphabetic codes, whereas PCA of protein sequence families requires conversion of sequence information into a numerical representation. Here, we introduce a new amino acid sequence conversion algorithm optimized for PCA data input. The method is demonstrated using a small artificial dataset to illustrate the characteristics and performance of the algorithm, as well as a small protein sequence family consisting of nine members, COG2263, and finally with a large protein sequence family, Pfam04237, which contains more than 1,800 sequences that group into two sub-clusters.
Principal components analysis of protein sequence clusters