Article,

Clustering molecular dynamics trajectories: 1. Characterizing the performance of different clustering algorithms

J. Shao, S. Tanner, N. Thompson, and T. Cheatham.
JOURNAL OF CHEMICAL THEORY AND COMPUTATION, 3 (6): 2312-2334 (November 2007)
DOI: {10.1021/ct700119m}

Abstract

Molecular dynamics simulation methods produce trajectories of atomic positions (and optionally velocities and energies) as a function of time and provide a representation of the sampling of a given molecule's energetically accessible conformational ensemble. As simulations on the 10-100 ns time scale become routine, with sampled configurations stored on the picosecond time scale, such trajectories contain large amounts of data. Data-mining techniques, like clustering, provide one means to group and make sense of the information in the trajectory. In this work, several clustering algorithms were implemented, compared, and utilized to understand MD trajectory data. The development of the algorithms into a freely available C code library, and their application to a simple test example of random (or systematically placed) points in a 2D plane (where the pairwise metric is the distance between points) provide a means to understand the relative performance. Eleven different clustering algorithms were developed, ranging from top-down splitting (hierarchical) and bottom-up aggregating (including single-linkage edge joining, centroid-linkage, average-linkage, complete-linkage, centripetal, and centripetal-complete) to various refinement (means, Bayesian, and self-organizing maps) and tree (COBWEB) algorithms. Systematic testing in the context of MD simulation of various DNA systems (including DNA single strands and the interaction of a minor groove binding drug DB226 with a DNA hairpin) allows a more direct assessment of the relative merits of the distinct clustering algorithms. Additionally, means to assess the relative performance and differences between the algorithms, to dynamically select the initial cluster count, and to achieve faster data mining by ``sieved clustering'' were evaluated. Overall, it was found that there is no one perfect ``one size fits all'' algorithm for clustering MID trajectories and that the results strongly depend on the choice of atoms for the pairwise comparison. Some algorithms tend to produce homogeneously sized clusters, whereas others have a tendency to produce singleton clusters. Issues related to the choice of a pairwise metric, clustering metrics, which atom selection is used for the comparison, and about the relative performance are discussed. Overall, the best performance was observed with the average-linkage, means, and SOM algorithms. If the cluster count is not known in advance, the hierarchical or average-linkage clustering algorithms are recommended. Although these algorithms perform well, it is important to be aware of the limitations or weaknesses of each algorithm, specifically the high sensitivity to outliers with hierarchical, the tendency to generate homogenously sized clusters with means, and the tendency to produce small or singleton clusters with average-linkage.

BibTeX key: ISI:000251024200037
entry type: article
address: 1155 16TH ST, NW, WASHINGTON, DC 20036 USA
year: 2007
month: NOV-DEC
journal: JOURNAL OF CHEMICAL THEORY AND COMPUTATION
number: 6
pages: 2312-2334
publisher: AMER CHEMICAL SOC
volume: 3
type: Review
issn: 1549-9618
doc-delivery-number: 232MQ
cited-references: AQVIST J, 1990, J PHYS CHEM-US, V94, P8021. BAUMKETNER A, 2007, J MOL BIOL, V366, P275, DOI 10.1016/j.jmb.2006.11.015. BAYLY CI, 1993, J PHYS CHEM-US, V97, P10269. BERENDSEN HJC, 1984, J CHEM PHYS, V81, P3684. BOLSHAKOVA N, 2002, CLUSTER VALIDATION T, P13. BOYKIN DW, 1998, J MED CHEM, V41, P124. BROOKS CL, 2002, ACCOUNTS CHEM RES, V35, P447. BUI JM, 2006, P NATL ACAD SCI USA, V103, P15451, DOI 10.1073/pnas.0605355103. BYSTROFF C, 2003, PROTEINS, V50, P552, DOI 10.1002/prot.10252. CALINSKI T, 1974, COMMUN STAT, V3, P1. CASE DA, 2005, J COMPUT CHEM, V26, P1668, DOI 10.1002/jcc.20290. CHEATHAM TE, 1998, J BIOMOL STRUCT DYN, V16, P265. CHEATHAM TE, 2000, ANNU REV PHYS CHEM, V51, P435. CHEATHAM TE, 2004, CURR OPIN STRUC BIOL, V14, P360, DOI 10.1016/j.sbi.2004.05.001. CHEESEMAN P, 1996, ADV KNOWLEDGE DISCOV, P61. CHEN HF, 2007, J AM CHEM SOC, V129, P2930, DOI 10.1021/ja0678774. CORMACK RM, 1971, J ROYAL STATISTICAL, V134, P321. CORNELL WD, 1995, J AM CHEM SOC, V117, P5179. DAGGETT V, 2002, ACCOUNTS CHEM RES, V35, P422. DAURA X, 1999, PROTEINS, V34, P269. DAVIES DL, 1979, IEEE T PATTERN ANAL, V1, P224. DAY R, 2006, J MOL BIOL, V366, P677. DEJONGE MR, 2007, PROTEINS, V67, P971, DOI 10.1002/prot.21376. DUAN Y, 1998, SCIENCE, V282, P740. ELEFTHERIOU M, 2006, J AM CHEM SOC, V128, P13388, DOI 10.1021/ja060972s. ELMER SP, 2004, J CHEM PHYS, V121, P12760, DOI 10.1063/1.1812272. FEHER M, 2003, J CHEM INF COMP SCI, V43, P810, DOI 10.1021/ci0200671. FEIG M, 2004, CURR OPIN STRUC BIOL, V14, P217, DOI 10.1016/j.sbi.2004.03.009. FISHER D, 1987, MACH LEARN, V2, P139. FISHER DH, 1987, IMPROVING INFERENCE, P461. FRISCH MJ, 2001, GAUSSIAN 98 REVISION. GABARROARPA J, 2000, COMPUT CHEM, V24, P696. GORDON HL, 1992, PROTEINS, V14, P249. GUHA S, 1998, P ACM SIGMOD INT C M, P73. HANSSON T, 2002, CURR OPIN STRUC BIOL, V12, P190. HARRIS SA, 2001, J AM CHEM SOC, V123, P12658. HAWKINS GD, 1995, CHEM PHYS LETT, V246, P122. HORNAK V, 2006, J AM CHEM SOC, V128, P2812, DOI 10.1021/ja058211x. HORNAK V, 2006, P NATL ACAD SCI USA, V103, P915. JAIN AK, 1999, ACM COMPUT SURV, V31, P264. JORGENSEN WL, 1983, J CHEM PHYS, V79, P926. JURASZEK J, 2006, P NATL ACAD SCI USA, V103, P15859, DOI 10.1073/pnas.0606692103. KARPEN ME, 1993, BIOCHEMISTRY-US, V32, P412. KARPLUS M, 2002, NAT STRUCT BIOL, V9, P646. KOHONEN T, 2001, SELF ORG MAPS, V30, P501. KOHONEN T, 2001, SELF ORG MAPS, V30, P501. KOLLMAN PA, 2000, ACCOUNTS CHEM RES, V33, P889. KORMOS BL, 2007, J STRUCT BIOL, V157, P500, DOI 10.1016/j.jsb.2006.10.022. KREISSLER M, 1989, J COMPUT AID MOL DES, V3, P85. LABOULAIS C, 2002, PROTEINS, V47, P169. LANKAS F, 2006, STRUCTURE, V14, P1527, DOI 10.1016/j.str.2006.08.004. LAUGHTON CA, 1996, BIOCHEMISTRY-US, V35, P5655. LEE MC, 2005, BIOPHYS J, V88, P3133, DOI 10.1529/biophysj.104.058446. LEVITT M, 1983, J MOL BIOL, V168, P595. LI DW, 2007, J PHYS CHEM B, V111, P5425, DOI 10.1021/jp0703051. LI WF, 2007, PROTEINS, V67, P338, DOI 10.1002/prot.21312. LI Y, 2006, J CHEM INF MODEL, V46, P1742, DOI 10.1021/ci050463u. LU YP, 2006, J AM CHEM SOC, V128, P11830, ARTN JA058042G. LYMAN E, 2006, BIOPHYS J, V91, P164, DOI 10.1529/biophysj.106.082941. MARCHIONINI C, 1983, BIOCHEM BIOPH RES CO, V112, P339. MAZUR S, 2000, J MOL BIOL, V300, P321. MICHEL AG, 1993, COMPUT CHEM, V17, P49. MITCHELL T, 1997, MACH LEARN, P432. MORAITAKIS G, 2003, BIOPHYS J, V84, P2149. NOE F, 2007, J CHEM PHYS, V126, ARTN 155102. NOY A, 2007, NUCLEIC ACIDS RES, V35, P3330, DOI 10.1093/nar/gkl1135. ODE H, 2007, J MED CHEM, V50, P1768, DOI 10.1021/jm061158i. PANDE VS, 2003, BIOPOLYMERS, V68, P91, DOI 10.1002/bip.10219. PASCHEK D, 2007, J STRUCT BIOL, V157, P524, DOI 10.1016/j.jsb.2006.10.031. PATEL S, 2007, J PEPT SCI, V13, P314, DOI 10.1002/psc.843. PEARLMAN DA, 1995, COMPUT PHYS COMMUN, V91, P1. PERIOLE X, 2007, J CHEM PHYS, V126, ARTN 014903. PETTERSEN EF, 2004, J COMPUT CHEM, V25, P1605, DOI 10.1002/jcc.20084. PONCIN M, 1992, J MOL BIOL, V226, P775. RAO F, 2005, J CHEM PHYS, V122, ARTN 184901. RAZGA F, 2006, STRUCTURE, V14, P825, DOI 10.1016/j.str.2006.02.012. ROCCATANO D, 2007, BIOPOLYMERS, V85, P407, DOI 10.1002/bip.20690. RUEDA D, 2007, P NATL ACAD SCI USA, V104, P796. RYCKAERT JP, 1977, J COMPUT PHYS, V23, P327. SATOH D, 2006, FEBS LETT, V580, P3422, DOI 10.1016/j.febslet.2006.05.015. SCHERAGA HA, 2007, ANNU REV PHYS CHEM, V58, P57, DOI 10.1146/annurev.physchem.58.032806.104614. SCHLITTER J, 1993, CHEM PHYS LETT, V215, P617. SCOTT EE, 2003, P NATL ACAD SCI USA, V100, P13196. SEFCIKOVA J, 2007, NUCLEIC ACIDS RES, V35, P1933, DOI 10.1093/nar/gkl1104. SHENKIN PS, 1994, J COMPUT CHEM, V15, P899. SIMMERLING C, 2002, J AM CHEM SOC, V124, P11258, DOI 10.1021/ja0273851. SIMS GE, 2005, P NATL ACAD SCI USA, V102, P618. SORIN EJ, 2005, BIOPHYS J, V88, P2472. SPACKOVA N, 2003, J AM CHEM SOC, V125, P1759, DOI 10.1021/ja025660d. SPEER N, 2005, LECT NOTES COMPUT SC, V3646, P429. SRINIVASAN J, 1998, J AM CHEM SOC, V120, P9401. SULLIVAN DC, 2006, J PHYS CHEM B, V110, P16707, DOI 10.1021/jp0569133. TAJKHORSHID E, 2003, ADV PROTEIN CHEM, V66, P195. TORDA AE, 1994, J COMPUT CHEM, V15, P1331. TROYER JM, 1995, PROTEINS, V23, P97. TSUI V, 2000, J AM CHEM SOC, V122, P2489. UNGER R, 1989, PROTEINS, V5, P355. VANDERVAART A, 2007, J CHEM PHYS, V126, ARTN 164106. VANGUNSTEREN WF, 1982, BIOCHEM SOC T, V10, P301. VANGUNSTEREN WF, 1982, BIOCHEMISTRY-US, V21, P2259. VANGUNSTEREN WF, 2006, ANGEW CHEM INT EDIT, V45, P4064, DOI 10.1002/anie.200502655. VESANTO J, 2000, IEEE T NEURAL NETWOR, V11, P586. WANG JM, 2001, J COMPUT CHEM, V22, P1219. WANG JM, 2006, J MOL GRAPH MODEL, V25, P247, DOI 10.1016/j.jmgm.2005.12.005. WATTS CR, 2001, J BIOMOL STRUCT DYN, V18, P733. WICKSTROM L, 2006, J MOL BIOL, V360, P1094, DOI 10.1016/j.jmb.2006.04.070. WILLETT P, 1987, SIMILARITY CLUSTERIN, V1, P266. WILSON WD, 1998, J AM CHEM SOC, V120, P10310. WITTEN IH, 1999, DATA MINING PRACTICA, P525. WONG CF, 2003, ADV PROTEIN CHEM, V66, P87. WU X, 2002, J AM CHEM SOC, V124, P5282. WU XW, 1998, J PHYS CHEM B, V102, P7238. WU XW, 2001, J PHYS CHEM B, V105, P2227. WU XW, 2004, BIOPHYS J, V86, P1946. XU Y, 2006, PROTEINS, V64, P1058, DOI 10.1002/prot.21044. YODA T, 2007, PROTEINS, V66, P846, DOI 10.1002/prot.21264.
affiliation: Cheatham, TE (Reprint Author), Univ Utah, Coll Pharm, Dept Med Chem, 2000 E 30 S,Skaggs Hall 201, Salt Lake City, UT 84112 USA. Univ Utah, Coll Pharm, Dept Med Chem, Salt Lake City, UT 84112 USA. Univ Utah, Coll Pharm, Dept Pharmaceut & Pharmaceut Chem, Salt Lake City, UT 84112 USA. Univ Utah, Coll Pharm, Dept Bioengn, Salt Lake City, UT 84112 USA.
journal-iso: J. Chem. Theory Comput.
author-email: tec3@utah.edu
keywords-plus: PROTEIN CONFORMATIONAL SPACE; INTEGRASE CATALYTIC CORE; EXPLICIT SOLVENT; HIV-1 PROTEASE; NUCLEIC-ACIDS; MINOR-GROOVE; EQUILIBRIUM SIMULATIONS; COMPUTER-SIMULATION; FOLDING SIMULATIONS; CONTINUUM SOLVENT
subject-category: Chemistry, Multidisciplinary
number-of-cited-references: 116
language: English
unique-id: ISI:000251024200037
DOI: 10.1021/ct700119m
times-cited: 0

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

%0 Journal Article %1 ISI:000251024200037 %A Shao, Jianyin %A Tanner, Stephen W. %A Thompson, Nephi %A Cheatham, Thomas E. %C 1155 16TH ST, NW, WASHINGTON, DC 20036 USA %D 2007 %I AMER CHEMICAL SOC %J JOURNAL OF CHEMICAL THEORY AND COMPUTATION %K 2D clustering %N 6 %P 2312-2334 %R 10.1021/ct700119m %T Clustering molecular dynamics trajectories: 1. Characterizing the performance of different clustering algorithms %V 3 %X Molecular dynamics simulation methods produce trajectories of atomic positions (and optionally velocities and energies) as a function of time and provide a representation of the sampling of a given molecule's energetically accessible conformational ensemble. As simulations on the 10-100 ns time scale become routine, with sampled configurations stored on the picosecond time scale, such trajectories contain large amounts of data. Data-mining techniques, like clustering, provide one means to group and make sense of the information in the trajectory. In this work, several clustering algorithms were implemented, compared, and utilized to understand MD trajectory data. The development of the algorithms into a freely available C code library, and their application to a simple test example of random (or systematically placed) points in a 2D plane (where the pairwise metric is the distance between points) provide a means to understand the relative performance. Eleven different clustering algorithms were developed, ranging from top-down splitting (hierarchical) and bottom-up aggregating (including single-linkage edge joining, centroid-linkage, average-linkage, complete-linkage, centripetal, and centripetal-complete) to various refinement (means, Bayesian, and self-organizing maps) and tree (COBWEB) algorithms. Systematic testing in the context of MD simulation of various DNA systems (including DNA single strands and the interaction of a minor groove binding drug DB226 with a DNA hairpin) allows a more direct assessment of the relative merits of the distinct clustering algorithms. Additionally, means to assess the relative performance and differences between the algorithms, to dynamically select the initial cluster count, and to achieve faster data mining by ``sieved clustering'' were evaluated. Overall, it was found that there is no one perfect ``one size fits all'' algorithm for clustering MID trajectories and that the results strongly depend on the choice of atoms for the pairwise comparison. Some algorithms tend to produce homogeneously sized clusters, whereas others have a tendency to produce singleton clusters. Issues related to the choice of a pairwise metric, clustering metrics, which atom selection is used for the comparison, and about the relative performance are discussed. Overall, the best performance was observed with the average-linkage, means, and SOM algorithms. If the cluster count is not known in advance, the hierarchical or average-linkage clustering algorithms are recommended. Although these algorithms perform well, it is important to be aware of the limitations or weaknesses of each algorithm, specifically the high sensitivity to outliers with hierarchical, the tendency to generate homogenously sized clusters with means, and the tendency to produce small or singleton clusters with average-linkage.

BibSonomy

Clustering molecular dynamics trajectories: 1. Characterizing the performance of different clustering algorithms

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on