copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Principal components analysis of protein sequence clusters

B. Wang, and M. Kennedy. J Struct Funct Genomics, 15 (1): 1-11 (March 2014)
DOI: 10.1007/s10969-014-9173-2

Abstract

Sequence analysis of large protein families can produce sub-clusters even within the same family. In some cases, it is of interest to know precisely which amino acid position variations are most responsible for driving separation into sub-clusters. In large protein families composed of large proteins, it can be quite challenging to assign the relative importance to specific amino acid positions. Principal components analysis (PCA) is ideal for such a task, since the problem is posed in a large variable space, i.e. the number of amino acids that make up the protein sequence, and PCA is powerful at reducing the dimensionality of complex problems by projecting the data into an eigenspace that represents the directions of greatest variation. However, PCA of aligned protein sequence families is complicated by the fact that protein sequences are traditionally represented by single letter alphabetic codes, whereas PCA of protein sequence families requires conversion of sequence information into a numerical representation. Here, we introduce a new amino acid sequence conversion algorithm optimized for PCA data input. The method is demonstrated using a small artificial dataset to illustrate the characteristics and performance of the algorithm, as well as a small protein sequence family consisting of nine members, COG2263, and finally with a large protein sequence family, Pfam04237, which contains more than 1,800 sequences that group into two sub-clusters.

Description

Principal components analysis of protein sequence clusters

Links and resources

BibTeX key: Wang:2014:J-Struct-Funct-Genomics:24496727
entry type: article
year: 2014
month: mar
journal: J Struct Funct Genomics
number: 1
pages: 1-11
volume: 15
pmid: 24496727
DOI: 10.1007/s10969-014-9173-2
url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3982804/

Cite this publication

@article{Wang:2014:J-Struct-Funct-Genomics:24496727, abstract = {Sequence analysis of large protein families can produce sub-clusters even within the same family. In some cases, it is of interest to know precisely which amino acid position variations are most responsible for driving separation into sub-clusters. In large protein families composed of large proteins, it can be quite challenging to assign the relative importance to specific amino acid positions. Principal components analysis (PCA) is ideal for such a task, since the problem is posed in a large variable space, i.e. the number of amino acids that make up the protein sequence, and PCA is powerful at reducing the dimensionality of complex problems by projecting the data into an eigenspace that represents the directions of greatest variation. However, PCA of aligned protein sequence families is complicated by the fact that protein sequences are traditionally represented by single letter alphabetic codes, whereas PCA of protein sequence families requires conversion of sequence information into a numerical representation. Here, we introduce a new amino acid sequence conversion algorithm optimized for PCA data input. The method is demonstrated using a small artificial dataset to illustrate the characteristics and performance of the algorithm, as well as a small protein sequence family consisting of nine members, COG2263, and finally with a large protein sequence family, Pfam04237, which contains more than 1,800 sequences that group into two sub-clusters.}, added-at = {2017-07-17T14:45:47.000+0200}, author = {Wang, B and Kennedy, M A}, biburl = {https://www.bibsonomy.org/bibtex/25261a11a947bb97e647fd9b7668d1037/suqbar}, description = {Principal components analysis of protein sequence clusters}, doi = {10.1007/s10969-014-9173-2}, interhash = {a6c34b3f59514bd9ffb86397c047d6fa}, intrahash = {5261a11a947bb97e647fd9b7668d1037}, journal = {J Struct Funct Genomics}, keywords = {clustering pca protein protein_family protein_sequence protein_subfamily sequence_space}, month = mar, number = 1, pages = {1-11}, pmid = {24496727}, timestamp = {2017-07-17T14:45:47.000+0200}, title = {Principal components analysis of protein sequence clusters}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3982804/}, volume = 15, year = 2014 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Principal components analysis of protein sequence clusters

Abstract

Description

Links and resources

Tags

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Principal components analysis of protein sequence clusters

Abstract

Description

Links and resources

Tags

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Principal components analysis of protein sequence clusters

Comments and Reviews
(0)