Abstract
Molecular dynamics simulation methods produce trajectories of atomic
positions (and optionally velocities and energies) as a function of
time and provide a representation of the sampling of a given molecule's
energetically accessible conformational ensemble. As simulations on the
10-100 ns time scale become routine, with sampled configurations stored
on the picosecond time scale, such trajectories contain large amounts
of data. Data-mining techniques, like clustering, provide one means to
group and make sense of the information in the trajectory. In this
work, several clustering algorithms were implemented, compared, and
utilized to understand MD trajectory data. The development of the
algorithms into a freely available C code library, and their
application to a simple test example of random (or systematically
placed) points in a 2D plane (where the pairwise metric is the distance
between points) provide a means to understand the relative performance.
Eleven different clustering algorithms were developed, ranging from
top-down splitting (hierarchical) and bottom-up aggregating (including
single-linkage edge joining, centroid-linkage, average-linkage,
complete-linkage, centripetal, and centripetal-complete) to various
refinement (means, Bayesian, and self-organizing maps) and tree
(COBWEB) algorithms. Systematic testing in the context of MD simulation
of various DNA systems (including DNA single strands and the
interaction of a minor groove binding drug DB226 with a DNA hairpin)
allows a more direct assessment of the relative merits of the distinct
clustering algorithms. Additionally, means to assess the relative
performance and differences between the algorithms, to dynamically
select the initial cluster count, and to achieve faster data mining by
``sieved clustering'' were evaluated. Overall, it was found that
there is no one perfect ``one size fits all'' algorithm for
clustering MID trajectories and that the results strongly depend on the
choice of atoms for the pairwise comparison. Some algorithms tend to
produce homogeneously sized clusters, whereas others have a tendency to
produce singleton clusters. Issues related to the choice of a pairwise
metric, clustering metrics, which atom selection is used for the
comparison, and about the relative performance are discussed. Overall,
the best performance was observed with the average-linkage, means, and
SOM algorithms. If the cluster count is not known in advance, the
hierarchical or average-linkage clustering algorithms are recommended.
Although these algorithms perform well, it is important to be aware of
the limitations or weaknesses of each algorithm, specifically the high
sensitivity to outliers with hierarchical, the tendency to generate
homogenously sized clusters with means, and the tendency to produce
small or singleton clusters with average-linkage.
Users
Please
log in to take part in the discussion (add own reviews or comments).