News & Events
OpenFOAM's I/O Problem (and solution)
Posted by Dr Stephen M. Longshaw
Much of the MNF group's research output has been based around our solvers (mdFoam+ and dsmcFoam+) which are written in the OpenFOAM software framework. OpenFOAM is well known and well acknowedged as a very flexible and stable environment to develop new solvers, however it has a bit of a reputation for scaling badly on big super computers, leaving people to assume it should only be used when your problem can be tackled by a stand-alone workstation or using only a few nodes on your favourite big HPC system. This blog post will talk about the new collated file format introduced into OpenFOAM 5.0 and how it might be the beginning of the end for this mentatility.
The question is, where has this perception come from and, more importantly, is it right? If you search for the issue of OpenFOAM scalability on HPC then you will find numerous articles and topics, what is interesting though is how few are a) looking at massive scalability (most consider running on a few CPUs) b) how few recent articles there.
The question therefore is whether OpenFOAM actually does perform badly on HPC system or is it an out of date perception. This is a hard one to answer fully as OpenFOAM has been around for a good few decades and has a number of different solvers to consider. In theory, each should parallelise as well as the others as they are all built on top of same basic libraries, however of course some algorithms work better in parallel than others and some of the solvers may not have receieved the same attention as others. Generally speaking though the methods used in OpenFOAM are sound, it employs typical static domain-decomposed non-blocking MPI in most of its solvers and allows well-known decomposition libraries such as Scotch to be used to minimise communication overhead. Undoubtedly this could all be optimised better if it were to receieve lots of attention from the HPC community but are there any other problems blocking this?
The MNF group runs many of its simulations on the UK's national HPC service Archer, run by the EPCC, a Cray XC30 machine. At the moment they provide access to OpenFOAM 4 on their system. Arguably OpenFOAM has a bad reputation for use on this system but the same problems are repeated on many systems, especially those that use a Lustre parallel file system and that is the way that OpenFOAM creates and deals with its files.
For every MPI process created, a new folder is also created and a set of files. In cases where lots of output is created during a simulation this can easily mean there are thousands of files per processor created on disk, Archer provides a hard limit per user on the number of files that can be created in their storage and also that they can have open in memory at any one time, parallel runs using OpenFOAM quickly exceed this and can have a major impact on the parallel file system for other users if the limits wern't there, as a result of this OpenFOAM has developed a bad reputation. It is worth noting that this approach is an entirely valid, if outdated, way of dealing with I/O when using MPI.
The good news is that, as of OpenFOAM 5.0, this has been changed and now there is a new way of writing files to disk known as the collated file format. This is a simple idea, rather than each MPI process creating its own folder, there is now just one set of files written by the master process and all other processes transfer data back via MPI. If you get hold of the latest development version via the OpenFOAM-dev repository then this has been further developed so you can mark individual MPI processes as "master node" writers to spread the load and reduce communication overhead as then processes only need to talk to each other within the same node. Therefore, if you were running on 48 nodes of Archer then you would have 1152 MPI processes with 24 on each node, so you would have 48 sets of files instead of 1152. This is really quite significant as if you assume there are 1000 files per set by the end of a simulation then you have 48,000 rather than 1,152,000!
We have done some basic testing and have found using the new file format to be about 50% faster on Archer using the flow past a motorbike tutorial case with simpleFoam and 48 nodes.
Of course the really exciting thing about this development is that the HPC community can now really get stuck in to the challenge of properly benchmarking OpenFOAM over many more MPI ranks than it has previousely attempted as cases now scale, this will therefore hopefully lead to rapid development of the underlying MPI approach and only serve to increase performance of OpenFOAM across all of its solvers, including the MNF group codes!