I recently began the task of working out why a piece of MD code ran slower for the same problem than an almost identical piece of code. Clearly the issue was down to compiler optimisation being circumvented in the slower code, but it was surprising quite how much of an effect this had on performance.
Like any good coder, the first thing I tried was the easy option, would a later version of the same compiler have better optimisation methods? The slow down was initially seen using GCC 4.4, used for historical reasons, so I re-compiled using GCC 4.8. Annoyingly the performance difference remained, even more annoyingly the simulation results were now completely different than with 4.4, overall sytem energy was similar but every single molecule had a different position and different associated properties after not many time-steps.
This unexpected result led me down a rabbit hole, forcing questions to be answered regarding the nature of determinism in software and the best way to ensure repeatable results.
It isn't unusual for different compiler versions to produce different floating point results, particularly if using unsafe optimisations or different hardware, but given I had compiled both of these applications with the same flags (even down to specifiying the architecture as "core2", the highest available in GCC 4.4), no unsafe optimisations were in place and everything was running in serial, I found the whole thing worrying.
You may now be thinking "but the overall system energy is the same, MD simulations are typically inherently random anyway, what's the issue here?", well, and this may be personal, I believe that a computer simulation should be controllably deterministic. By this I mean, if I write a scientific paper and specify the computing architecture and version of the software used, along with the input parameters, then the reader should be able to repeat my experiment exactly. Arguably, this could be extended to say that the compiler and specific CPU used could be defined and if the reader uses anything other than this then they can expect different results, but with my computer scientist hat on this does not feel like good, robust simulation code. It is unlikely that everybody will have access to exactly the same compiler or CPU type (though their architecture may be same, i.e. x86), it is reasonable to be able to define the compiler to be GCC, but not to specify that it must be GCC 4.4.7 for things to work as expected.
So what has changed between GCC 4.4 and GCC 4.8 to produce such different code? To cut a very long story short, I am not yet at a final answer. What I can say is that differences still appear if I compile at any optimisation level (i.e. -O1, -O2, -O3) and even without optimisation, however, GCC has an interesting compilation option:
-ffloat-store Do not store floating point variables in registers, and inhibit other options that might change whether a floating point value is taken from a register or memory. This option prevents undesirable excess precision on machines such as the 68000 where the floating registers (of the 68881) keep more precision than a "double" is supposed to have. Similarly for the x86 architecture. For most programs, the excess precision does only good, but a few programs rely on the precise definition of IEEE floating point. Use -ffloat-store for such programs, after modifying them to store all pertinent intermediate computations into variables.
When this is used with GCC 4.8 at any optimisation level, the code produces identical results to that produced with GCC 4.4 without the same flag. Clearly GCC 4.8 introduces a slightly different algorithm somewhere that either uses or no longer uses the excess precision available in the FPU. Exactly which of the new options does this has yet to be determined but it raises an interesting question, given the results produced by GCC 4.8 are probably more accurate (though this cannot be guranteed), yet many users of the code in question are still using GCC 4.4 as this is what the previous version required, is it better to reduce the quality of the output compiled with later GCC versions or to make the user base aware and suggest they upgrade their compiler? I'll let you know if I come to a conclusion!
To wrap up this post, another determinsim related issue has cropped up during my investigations, how best to use random values in scientific simulations.
True randomness does not really exist in standard computations (we'll ignore quantum computers for now, producing truly random values is a scientific discipline of its own), when you call "rand(seed)" in your software, you are producing a deterministic set of pseudo-random values based on the input seed. Often it seems sensible to seed this using a non-specific value such as the time so that a random effect is apparant to the user as the sequence will change depending on when they run the software. The problem with this when it comes to scientific software boils down to repeatability. Results published in a paper should be repeatable by the reader. The only sensible thing therefore is to define the seed used and publish this along with the results.
Finally, a word of warning for those developing with OpenFOAM, a class called Random is provided as a primitive. If you want to produce random values it seems sensible to use this. However, it is very important to understand how this class actually works. It is simply a wrapper around system level random functions, so when you instantiate a copy in your code and provide it with a seed, it is actually calling the standard system level seed function, meaning ALL instances of Random are effectively re-seeded by the last instance to be created.
This won't be an issue if you create an instance, use it immediately and then let it die or lay dormant, however if you create one instance of Random as part of one object and then again as part of another but then use it from the FIRST object created, the pseudo-random sequence will based on the seed you passed when creating the SECOND object.
It is also worth noting that, by default, the Random class uses the oldest and (arguably) worst of the inbuilt POSIX random number generators unless you compile with the flag "-DUSE_RANDOM", in which case it uses the best, for reference, there are three available by default on most systems:
- rand48() family of methods: These functions generate pseudo-random numbers using the linear congruential algorithm and 48-bit integer arithmetic (this is used by OpenFOAM by default).
- rand(): This function returns a pseudo-random integer in the range 0 to RAND_MAX inclusive (i.e., the mathematical range [0, RAND_MAX]).
- random(): The random() function uses a nonlinear additive feedback random number generator employing a default table of size 31 long integers to return successive pseudo-random numbers in the range from 0 to RAND_MAX. The period of this random number generator is very large, approximately 16 * ((2^31) – 1) (this is used by OpenFOAM if compiled with -DUSE_RANDOM).
So work continues to make sure the code in question finds a good balance between reliable determinism and performance!