Scalable advanced sampling in molecular dynamics using standalone tools for data mining (CHARMMing PIGS)
PI: Amedeo Caflisch (University of Zurich)
Co-PIs: Andreas Vitalis (University of Zurich), Claudio Gheller (CSCS / ETH Zurich), Michael Crowley (National Renewable Energy laboratory, USA)
April 1, 2015 - March 31, 2017
Project Summary
Classical molecular dynamics (MD) simulations have proven to be a tool of outstanding value in biophysical studies of biomacromolecules [1]. Due to the fully atomistic resolution offered by this technique, MD simulations are able to both predict and complement experimental data with mechanistic and structural insight that would be impossible to obtain by experiment alone [2]. In recent years, a focus on software and hardware engineering and computing efficiency has thoroughly pervaded the development of software packages implementing MD methodologies, which tended to be heterogeneous and poorly optimized. This change in paradigm from a development model focused on scientific algorithms and broad functionality to performance optimization was driven primarily by the desire to simulate longer time scales [3]. In reality, general purpose high-performance computing (HPC) hardware offers limited scalability for MD simulations due to the lightness of the computation per step, and this is exemplified by the development and vastly superior performance of special purpose machines such as Anton [4]. Here, we propose to overcome the scalability issue by establishing a working implementation of a class of advanced sampling methods that propagate the dynamics of many copies of a system in parallel. Sampling is enhanced by selectively restarting individual copies from selected time points of other copies, a process we refer to as reseeding. Techniques falling into this class are in widespread use by the MD community and include the replica exchange method [5, 6] or the original distributed computing approach [7]. Our specific aims center around a related method, progress index guided sampling (or PIGS) [8] that we developed recently. The goals of the project are as follows:
- Specific Aim I: Develop, implement and test parallel implementations of two required data mining techniques we developed recently: a tree-based clustering algorithm [9] and an algorithm to arrange and annotate time series data to reveal metastable states [10, 11].
- Specific Aim II: Establish a suitable development platform for a fully scalable implementation of the PIGS protocol on general purpose HPC resources and extend and/or modify it to incorporate the parallel versions of the algorithms developed in Specific Aim I. Test the resultant, scalable PIGS scheme.
- Specific Aim III: Through performance tests on diverse topics, we aim to evaluate and refine the PIGS reseeding heuristic. The list of applications should include simulations of biomolecular self-assembly, protein simulations with the goal of diversifying targets for virtual screening campaigns, and, timepermitting, simulations of the formation of ordered, heterotypic interfaces.
With Specific Aim I, we will accomplish two worthwhile goals. First, the scientific community in general and the users of CSCS facilities in particular are provided with access to scalable, general purpose data mining algorithms that overcome both time and space limitations through parallelization. The clustering algorithm [9] can be applied to any data set for which a notion of distance between samples can be defined. It is useful for very large data sets of high dimensionality as it scales linearly with the number of samples and, by virtue of a tree-based data structure, is able to detect and obey local sample density exceptionally well. Big data are routinely produced in many supercomputing applications, e.g., in climate research, in astrophysics, or in atomistic simulations. The aforementioned applications usually yield time series data. For data of this type, we have developed an algorithm that arranges these data in an informative way, the so-called progress index [10]. This algorithm also scales linearly with the number of samples and the sequence, i.e., the progress index, can be annotated to reveal metastable states [11]. Both algorithms will be provided in a standalone library available to anyone.
The second goal we will accomplish with Specific Aim I is to provide the necessary groundwork for Specific Aim II. Here, we will extend an existing molecular dynamics software package that has support for hybrid compute nodes to interact with the newly developed standalone library to provide a fully scalable implementation of the PIGS advanced sampling protocol. The development platform will be CHARMM due to the expertise of the teams carrying out most of the proposed work. CHARMM has recently been modernized [12] by a team lead by one of the co-PIs of this proposal, and the latest versions offer competitive performance both in parallel and multicore CPU settings and on GPUs. The choice is justified not only on account of our collective expertise but also on account of CHARMMs large user base and its role as a highly influential tool in the computational biophysics community. The PIGS protocol aims to reduce sampling redundancy by evolving many copies of a system in parallel. To do so, it analyzes a composite data set with contributions from all copies from a fixed time window. The decision to reseed simulations in a nonrandom fashion is what provides the main benefit of the method, i.e., a much improved rate of phase space exploration. In conventional MD, exploration is hindered by the large life times observed for metastable states in complex systems. Evaluation, optimization, and testing of the details of the heuristic used for reseeding are the content of Specific Aim III.
Many processes of interest, for example that of protein aggregation, exhibit a large number of states and pathways that are often under kinetic and not thermodynamic control [13]. For these processes, equilibrium MD simulations are almost impossible to use, especially given that structural details of many relevant states have thus far not been elucidated. Importantly, the parallel communication between copies in PIGS yields a synergistic algorithm that is explicitly designed to exploit the width of HPC resources. This means that a single calculation using N copies will of a complex system yield a faster rate of exploration of phase space than, for example, two independent calculations each using N/2 copies. Therefore, we anticipate that increasing the width (number of copies) of a PIGS run provides additional gains not accessible otherwise. For computational efficiency to be maintained, the PIGS protocol must be implemented in a fully scalable fashion, and this is the primary goal we pursue in the scope of this proposal.