Multi-level parallelization and hybrid acceleration of simulation codes

While on my way back from a stimulating research meeting at Glasgow I read an article "The world’s fastest software for Molecular Dynamics on CPUs & GPUs" from the Swedish e-Science Research Centre website.

The article described the parallelization strategy of GROMACS and highlighted the challenges in parallelization posed by hardware which is continuously becoming more heterogeneous. This is true, in recent times processor chips with multiple cores have become a common feature, for example hundred of cores are available in some GPUs. Simulation codes are also trying to keep pace with the development in multi-core systems, for example "Molecular Dynamics Simulation of Multi-Scale Flows on GPUs" carries out hybrid acceleration of MPI based OpenFOAM to harness the computational power of GPUs.

It is expected that the increase in number of cores per chip will be relatively much faster than the increase in the processor clock speed. In such a scenario, MPI which is the de-facto standard for large-scale scientific computation will face a stiff competition from hybrid parallelization based on process level MPI and threads-level OpenMP. Most of the simulation codes still rely on MPI library for its parallelization. For example in OpenFOAM, MPI library is plugged through the "Pstream" interface where all parallel communications are wrapped while its "decomposePar" utility implements the domain decomposition. An interesting article "Evaluation of Multi-threaded OpenFOAM Hybridization for Massively Parallel Architectures" reports a case where MPI communication becomes a real bottleneck to scalability, and a hybrid multi-threaded approach is suggested as a possible solution.

The multicore evolution has certainly left the memory and cache subsystem to lag more and more behind, with non-uniform memory access (NUMA) effects this can indeed become a performance bottleneck. Multi-level parallelization strategy addresses the NUMA and communication related issues by combining MPI (distributed-memory) and threads (shared-memory) for efficient intra-node parallelism. If we look at GROMACS it implements multi-level parallelization as well as hybrid acceleration by:

  1. SIMD (single-instruction multiple-data) parallelization at the instruction level,
  2. OpenMP between cores within nodes,
  3. MPI between nodes,
  4. employing hybrid acceleration such that GPUs carry out compute intensive part (non-bonded force calculation) while rest is run on CPUs.

To know more on that refer Pronk S et al. (2013) "GROMACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit" Bioinformatics, Vol. 29, pp. 845-854.

GROMACS take advantage of both MPI and OpenMP programming models. That is why GROMACS has been identified by PRACE1 pan-European HPC initiative and the CRESTA2 exascale collaborative projects as a key code for effective exploitation of both current and future HPC systems (http://www.hector.ac.uk/cse/distributedcse/reports/GROMACS/).