High Performance Computing – Petascale? Exascale?? Zetascale???

Being able to run large computational simulations plays an important role in the work of the Micro & Nano Flows group. Some of the methods we use like Molecular Dynamics or Direct Simulation Monte Carlo are incredibly useful but take a long time to calculate even moderate sizes of problems. We tend to work on problems that are vast in molecular terms (albeit still at a tiny scale) such as considering the flow of a fluid through a nanotube or considering pore-scale evaporation from a hot surface. High Performance Computing (HPC) is therefore a great enabler of much of the work of the MNF group and a topic of particular importance and interest to me.

I wanted to briefly explore the current state of the HPC world as there are some genuinely exciting but equally daunting changes going on at the moment which will simply change the face of what we know as HPC in the not to distant future and codes like the ones we both develop and use will need to adapt to remain relevant. So the big question is, what is changing and how?

Perhaps the easiest and most obvious example is the current push towards so-called exascale computing. Simply put, this is the mission to have a single computing resource that is able to provide a peak processing performance of one exaFLOPS (FLOP = Floating Point Operations/s). Remarkably it is looking likely machines capable of this will be available in some countries within the next couple of years. This is truly remarkable when you consider the most recent machines are capable of less than half, with power in the hundreds of petaFLOPS.

So, what is changing to make this possible? Everything (sort of)!

Currently most that use HPC, especially in the UK, are used to large homogeneous systems typically based around Intel Xeon CPUs. That is to say, large clusters of identical or similar nodes, typically with 24-36 cores per node and 64-128GB of RAM. Moving outside of the slightly conservative UK, there has been a push over the past few years to also include accelerator technology, such as GPUs (or perhaps the Intel Xeon Phi before it’s recent demise), which when combined with the CPUs effectively can make for a system with a similar physical and power footprint but that provides much higher peak performance figures.

In fact, until a few days ago, the fastest system in the world, Summit at Oak Ridge in the USA, is a mixture of IBM Power9 CPU cores and NVIDIA GPUs, in fact many of the larger systems in the official “TOP500” rankings are likely this. It is this direction that gives us a hint of what’s to come and what we need to prepare for.

Going back to the example of the UK’s HPC scene, the most obvious system for those doing the sort of work the MNF group undertake is the UKRI-EPSRC owned service “ARCHER“, run by the Edinburgh Parallel Computing Centre. This is a very typical, homogeneous system. However, the replacement for this, “ARCHER2” is quite different and points to the second major difference we will see in these new exascale systems.

While ARCHER2 is still (disappointingly) a homogeneous system, it is going to be a new Cray Shasta system, more interestingly though it is going to be based on AMD CPUs and not Intel. This is a significant shift and brings with it a very different computing environment. Where ARCHER has 24 Intel Xeon cores per node, ARCHER2 will have 128 AMD EPYC Zen2 cores. As it will have 5848 nodes, this means it will have 748,544 physical CPU cores, this is compared to 109,056 physical cores in ARCHER. Quite the difference. While a single ARCHER core can’t really be compared directly with a single ARCHER2 core as they are completely different architectures, it is probably reasonable to say that they are likely to be fairly even in terms of performance. Sounds amazing right?

Well.. yes and no. Clearly from a raw performance point of view, yes. However, think about how current codes scale when run in parallel. For example, we make use of the OpenFOAM suite of tools in our work and current versions have been shown to scale to the low tens of thousands of cores, but this is at a push. What is interesting is whether this same can be said when scaling across 2 physical processors providing 128 cores, compared to just 24. In theory things should be faster if everything is on the same motherboard, however there are many examples of current codes that simply don’t scale well on this kind of CPU (i.e. the IBM Power series or Intel Xeon Phi) due to internal memory bandwidth issues. These problems are not insurmountable but need code development to rectify them.

So with all of this in mind, what is actually coming to HPC in the next 5 years as we move to exascale. Well, it is lots of (relatively) small CPU cores, decent amounts of memory, fast, flash based storage in layers that should remove much of the current latency legacy parallel systems exhibit and, importantly, inherent and tightly integrated accelerator technology. Most likely based around GPUs (both NVIDIA and AMD).

What is very clear though is that those developing codes for HPC systems can no longer just get things to scale to a few thousand cores and ignore accelerator offloading. This is not “the future” or “emerging technology” anymore. This is the new normal.

If it needs to be hammered home, just look at the new No.1 in the world, the crown has gone to Japan with the ARM based (yes those quirky British designed chips that power your phone) “Fugaku” system hosted at the RIKEN Centre for Computational Science in Kobe. This has nearly 7.3 million ARM cores (take that in!) and weighs in nearly 3 times faster than the previous No.1, Summit, at over 400 petaFLOPS. In fact, the system can already just reach an exaFLOP in peak (i.e. unrealistic) performance tests. Amazing stuff.