Parallelism… to infinity, and beyond!

I'd like to take this opportunity to pose a few questions, along with my own thoughts, regarding the upcoming state of parallel computing and what it means for not just Molecular Dynamics code like the one the MicroNanoFlow group have implemented within OpenFOAM (available publicly soon), but for the way scientists using computational resources are going to need to think in order to keep up as we switch to the age of massively parallel design.

The current High Performance Computing (HPC) landscape for the academic researcher is always changing, but up until recently it is fair to say the scale of the changes have been relatively small. Typically, supercomputing resources get bigger, providing more cores that are faster (or more efficient) and offer bigger/faster pools of memory. There have also always been new and interesting "accelerator" add-ins that provide benefit to specific computational tasks. A good example of these are NVIDIA's line of Tesla GPU processors or more recently Intel's Xeon Phi add-in cards. Historically though, the interesting thing with accelerator add-ins (I’m looking at you separate hardware floating point unit…) is that when they become widely adopted and deemed useful enough, the big CPU architectures tend to absorb them or adopt the concept somehow.

To concentrate on GPUs for a moment, for those unfamiliar with their history, the idea of a dedicated graphics processing chip is not new, it has been around for a long time and ultimately the job of such a device is quite simple, to process simple sets of data as quickly as possible and translate that into pixels that can be viewed on a screen. The best way to do this quickly is in as parallel a manner as possible and store the data in a pool of memory that is as quick and has as big a data bus between it and the processor as possible. This basic premise led to the evolution of GPUs that effectively had hundreds of individual processors, each in itself with low computational capability, but as a whole (when combined with expensive but fast memory and fat data buses) offering an impressive computing potential in a small package.

Of course the only way to "program" these processors was via the graphics libraries they were originally designed to handle (OpenGL, DirectX etc.), however when a few clever people at NVIDIA HQ saw an alternative way forward and released the CUDA suite these GPUs became an accessible tool for many more people, not least scientists solving interesting problems. Of course many codes have not adopted GPU acceleration either because their problem was not suited to the architecture type or because the effort required wasn't deemed acceptable by the developers. However, as things continue to progress it is looking more and more likely that traditional methods of parallelising problems may have to be reconsidered and the often more time consuming methods such as OpenMP given more thought.

Since CUDA's release in around 2007 the state of the GPU market has evolved rapidly, as has CUDA and other GPU programming methods. We are now at the stage where top end GPU devices such as the Tesla K80 offer a potential maximum of well over 4000 "cores" and 12GB of extremely fast memory in a single package consuming less than 300W. Of course this has sparked competition and the likes of Intel have seen the potential for this different way of looking at processor design, which brings to light the Xeon Phi range.

These cards currently offer a smaller number of "fatter" cores (currently this stands at 60, each of which can handle up to 4 threads) than a GPU but on paper both have similar processing potential. There are also a number of new programming paradigms being fuelled by these architectures, with traditional methods such as OpenMP being forced to rapidly evolve at the hand of Intel so they become more suitable for the likes of the Phi while other efforts such as PGI/NVIDIAs OpenACC are quickly turning into useful standards with high quality associated libraries and tools. OpenACC is effectively an OpenMP-like alternative that aims to target multiple platform types, from multi-core CPU through to GPUs. Also of interest are the IBM Power8 and upcoming Power9 architectures.

The interesting thing of all this is that there is a clear trend from the traditional design of having a small number of very fat CPU cores in a single package (aka the current Intel Xeon and AMD offerings) to providing finer-grained parallelism within the same package. The clearest case in point is that the next version of Intel's Xeon Phi will be a fully system-on-chip design named Knights Landing (https://software.intel.com/sites/default/files/managed/e9/b5/Knights-Corner-is-your-path-to-Knights-Landing.pdf) instead of an add-in card. Effectively, instead of a machine having a typical Intel Xeon x86 CPU with 8 or so very powerful cores, it will have access to 60-100 cores that are individually much weaker but when combined more powerful.

Software that has previously been written to work across multiple shared memory processors using something like OpenMP will automatically work on this new style of Intel CPU, but it is likely many will see a decrease in performance unless they then spend time better refining things to take advantage of vectorised operations as well as ensuring the code actually scales in parallel terms. Codes which use only MPI for their parallelisation may struggle even more…

So what do this mean for future software designs? First and foremost it means all future scientific software should be designed from the ground up to exploit parallel execution. No ifs, no buts, serial processing is no longer an option. That aside, there are an ever growing number of ways of thinking serially when designing an algorithm and letting other software handle the transition. These range from automatic methods (such as auto-vectorisation employed by compilers) through to domain-specific languages that hide all of the complexities of making code that is optimised for any one hardware type (though somebody still has to do the work at some point…). The best methods for most though are the ones somewhere in the middle, the likes of OpenACC or OpenMP.

Using these we can still program in normal languages but take advantage of future architectures by way of simple "pragmas", or directives, placed in code. We can also use these methods to help accelerate existing codes on these new architectures. The nice thing is that as long as everything we do is standards compliant, as underlying libraries upgrade and evolve, providing better performance or compatibility with new architectures, so too will the software built upon it. The most important factor in these designs is often an understanding of designing memory structures suitable for parallelisation. This is especially important for methods like OpenMP or OpenACC where a memory structure only suited for serial execution can even lead to slow down once parallel execution is attempted. Not all problems have an obvious solution to this, in those cases sticking with multi-process parallelisation methods like MPI may still be the better way forward.

In summary then, the message is becoming increasingly clear, when designing software there is always likely to be the concept of distributed computing because the amount of processing you can physically locate in an enclosed space will always be limited, therefore methods like MPI are unlikely to go anywhere (though they may be slowly absorbed into the underlying fabric of compilers and other related parallelisation tools we use so we no longer need to explicitly worry about where data is). However, the level of parallelism we are exposed to within each shared memory resource is increasing, rapidly. Relying on one form of parallelisation such as MPI is becoming less and less sufficient and so mixed parallelisation strategies that incorporate techniques such as OpenMP or OpenACC with distributed (i.e. MPI) based methods is likely the best approach for a future proof code. That way we can make good use of fine-grained parallelism (be that via the host processor or using an accelerator such as a GPU) while still being able to distribute the problem between discrete pools of memory.