Running large MD simulations on ARCHER

In a recent blog titled "MDFoam on Archer" by Saif, it was announced that our team has migrated MDFoam code from OpenFOAM-1.7 to OpenFOAM-2.1. They have also run few simulations to test the performance of our MD code on ARCHER and found that the code scales well which is a happy news for all those like me who wants to run large simulations on ARCHER.

Here I would like to share some of the information that I learned while trying to run some of my large simulations, on ARCHER, which take more than a week to complete. It is important to know that ARCHER has a runtime limit of 24 hours on regular queue (48h on long queue) which means that any simulation that runs for more than a day will be terminated by the job scheduler (ARCHER uses PBS for job scheduling). In this case we need to restart our simulation from the last timestep where the simulation is stopped. If it is only one or two simulations that one would like to run for a week then one way is to try resubmitting the restart job every other day for 7 days until the simulation finishes. However if you have several simulations to run for weeks then resubmitting the restart jobs manually could become tedious. So one solution is to automate the job submission and use check-pointing to save current state of a simulation when it is terminated by the scheduler. Below is my PBS script that I used to test automating restart job submission. Please feel free to use the script as per your needs but I highly recommed you to test it on small simulations before using it to run large small simulations to make sure that you get what you expected .

The other two important points in order to use this script is to: i) enable checkpointing by using startTime lastestTime; in your controlDict file. If checkpointing is not enabled then MD simulation will always use the same timestep 0 (if startTime 0 is used) to restart the simulation and it will never finish. ii) load module leave_time utility by running the command module load leave_time. To conclude, this script can be used on any HPC system as long as you can use a utility such as GNUs timeout instead of leave_time command to kill the mdPolyFoam application before the runtime limit.

########## Beginning of the script ##############
#!/bin/bash --login
#PBS -N Test
#PBS -l select=2
#PBS -l walltime=00:24:00
#PBS -A d70
#PBS -j oe
#PBS -V

module load leave_time

export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)
# change to the directory that the job was submitted from
cd $PBS_O_WORKDIR

# run mdPolyFoam using 48 cores
# leave_time is ARCHER utility (used GNU timeput) 
# it kills mdPolyFoam exactly at 23h:59m:30s
leave_time 30 aprun -n 48 mdPolyFoam -parallel

# restart simulation if not completed
# also backup any data that is written by MDFoam
# please note that I use file "processor0/timings/cpuTimeProcess_evolve_average.xy"
# to get the current time
# so make this file exists for your case else use another file to get the current time
finishTime=$(grep -m 1 '^endTime' system/controlDict | awk '{print $2}' | rev | cut -c2- | rev)
currentTime=$(awk 'END{print $1}' processor0/timings/cpuTimeProcess_evolve_average.xy)

if [[ "$finishTime" -ne "$currentTime" ]]; then
# backup/copy information (Note: use backup only if needed)
mv processor0/fieldMeasurements processor0/fieldMeasurements-$currentTime
# after backup submit restart job 
qsub submit.pbs
else
echo "Simulation is completed!"
fi

########## End of the script ##############