… or, "Why object oriented C++ is sometimes considered a bad choice for scientific programming and you need to be careful when using OpenFOAM".
Following on from my previous blog entry (Determinism(?), 16.01.2015), I am still embarking on a quest to improve the performance of the MicroNanoFlow group MD code (MNF-MD code). I have been focussing my efforts on baseline optimisation as if the basic serial performance of the code is well optimised then more complex parallel implementations will be building on good foundations.
There are two branches to consider to code optimisation. The first is the general overhead of the programming style of your code (i.e. how object orientation is used, the choice of data structures and how well memory is managed etc.) and the second is the overhead of the algorithms themselves. As I am approaching this code from a programmers perspective the most obvious first port of call is the general programming overhead.
One way to find hotspots and performance bottlenecks is to profile using a simple and relatively small test case, run for a reasonable number of iterations. The use of a small case reduces the impact of the algorithmic calculations and allows the underlying coding overhead to be revealed. So with this in mind, I ran the MNF-MD code through GNU's GPROF profiling application (after recompiling with -pg) and got some interesting results!
Looking at the flat profile:
What does this tell us? Well, firstly we can see that GPROF is struggling a little with the complex object-oriented (OO) nature of an OpenFOAM application… but that aside we can see that lots of our time is spent calling destructors, we can also see that there are many calls to the function eraseHead(). This is all very interesting as it points to the fact that lots of objects are being destroyed (and therefore created) throughout the life of the application, this is also pointed to by the large number of calls to eraseHead() as this is an internal function used within the OpenFOAM dynamic data structures during a resizing operation (be that creation, destruction or something else).
This is an interesting (and worrying) story for the performance of the software as, ideally, an MD code should create its major data structures once and then manipulate that memory during the life of the simulation, finally destroying them on exit. This is true because most MD simulations are (more or less) fixed in size based on initial input parameters, sure there may be cases of molecules being inserted or deleted but typically the size of the holding data structures won't change that often.
On closer inspection the reality is that a single main data structure is created in the form of a "polyCloud" (this is a doubly-linked list if anybody is interested) and from this multiple dynamic structures are cleared and then re-populated, often at each time-step, to provide an easily accesible list of neighbouring molecules. On the surface of things this is a sensible approach, certainly it follows the examples set by existing OpenFOAM code, but as any high-performance scientific programmer will know, the act of continuously clearing and then resizing dynamic structures is something to be avoided if possible, and here's why:
When memory is allocated in a typical C application this is done using "memset", I won't go into the detail of memset as most will be familiar however the basic idea is to provide a pointer to store the starting memory location of the block being allocated (which also defines the primitive type) and a size in bytes. This provides a contiguous block of memory (i.e. one without any gaps) which (again without going into too much detail) is typically beneficial to application performance. The C++ equivalent would be a call to "new", which is effectively the same thing as calling "memset".
However, this is where the object-oriented approach can sometimes mislead and result in lower performance applications than traditional C (or Fortran) code. When a block of memory is allocated in a single "memset" or "new" call it is contiguous, so (for example) we might allocate a number of arrays equal in size to the number of molecules in the simulation to store their various properties (i.e. position, velocity). However, in the object-oriented OpenFOAM equivalent this might take the form of a container data structure (say a List) which is then populated by as many Molecule objects as there are molecules in the simulation by way of iterating as many times we there are molecules and using the "append" function provided by the data structure.
This is great you are probably thinking, it means a) the data structure resizes itself as needed and b) each element is a compound object of type Molecule that contains all of the data we need in one place and you'd be right… except for the following: Every time "append" is called a number of operations need to happen to resize the block (which in cases where contiguity is to be preserved would involve creating a totally new data structure and copying everything from one to the other before the old one is deleted) and every single element in the structure is created using "new" (as they are objects) meaning that the actual data for each of the Molecule objects could be anywhere in memory, there is no gurantee that consecutive calls to "new" will provide the next block of memory from the end of the last meaning we end up with our Molecule objects fragmented throughout the memory address space.
So OK, we can see that from a memory layout perspective the OO approach may not be ideal for high-performance applications, but there is a level of flexibility that it provides, plus with careful programming we can improve this anyway (i.e. OpenFOAM provides a set of FixedList structure for cases where the size of the data structure is known beforehand), however there is also a non-trivial level of computational overhead associated with creating an object compared to something simpler (say a struct) or of course simply using a primitive type. Now if this overhead were only encountered once at Molecule initialisation then it could be overlooked, however we can see from our profile that in fact there is a significant amount of object destruction (and therefore instantiation) going on here, especially of List objects. This is bad, it means that the act of memory allocation and deallocation is actually a dominant overhead, and in a simulation software that effectively relies on a fixed problem size, this cannot be optimal.
And so onto the crux of the problem and why programming in the OpenFOAM style has to be done so carefully. I mentioned that a single data structure is created as part of the polyCloud once at startup and then multiple dynamic structures are repeatedly cleared and repopulated from this. Well this in itself is a large contributing factor to the overhead seen in the profile, many of those calls to "eraseHead" stem from this.
However, there is also another issue to consider which perhaps doesn't revel itself in the GPROF flat profile, as part of the process of populating some of these structures we see something like "append(molecule.clone())", this little innocuous call has great implications, what it actually means is that, rather than a pointer being stored to the existing Molecule object from our polyCloud structure, a totally new copy is created (including all of the instantiation overhead associated with that), this is then used for the lfietime of the structure (i.e. until the next time "clear()" is called) at which point it has to be destroyed and so the cycle continues. This is the essence of the danger of high-performance OO programming, it becomes second nature to create and destroy objects while forgetting how much overhead is actually associated with that (not to mention the fact that associated memory with the application is constantly being scattered around, destroying potential speed-up benefits of CPU cache coherency etc.).
There are no real conclusions to this post other than to say the current data structures in the MFN-MD code are being reconsidered. The ideal will be a situation where only a single instance of a Molecule object is ever created and as many data structures are fixed in size for the lifetime of the simulation as possible. Clearly not the easiest of tasks but I'll report back next time and provide some before and after comparisons!