Optimising OpenMP implementation of Tinker, a molecular dynamics modelling package
Posted: 3 Jul 2014 | 11:42
Do you use scientific codes in your research? Are the things you can do with it limited by the execution time? The code has been parallelised but does not scale well? How should you go about improving the performance? What can you do when you do not have full understanding of the code? There are some general steps that can be taken to improve the performance of parallelised codes. In this article I will describe briefly the process I have undertaken to optimise the parallel performance of a computational chemistry package, TINKER, as part of the EPCC/SSI APES project.
TINKER can be used to perform molecular modelling and molecular dynamics simulations. Originally written in Fortran 77 and recently ported to Fortran 90, it has already been parallelised for a shared memory environment using OpenMP. The code does not scale well with increasing number of cores, that is to say the performance does not improve as more cores (the processing units in processors – effectively mini processors) are added. The scaling is even poorer on AMD architectures. In my investigation I used INDY, a cluster machine hosted by EPCC consisting of 24 compute nodes, each with four 16-core AMD Opteron 6276 2.3 GHz Interlagos processors, giving a total of 64 cores per node that share memory. As the parallelisation of TINKER is purely for shared memory, all my investigations were restricted to a single node.
On running the code on this system, one quickly finds that although using more than one thread shortens the execution time, the improvement is not proportional to the number of threads used. One usually uses speedup (the ratio between the time taken to execute the code on one thread compared to using multiple threads for the same problem) to measure the efficacy of the parallelisation of the code. The largest speedup obtained in this case was 3.63 observed for 16 threads implying a parallel efficiency of only about ~23%. The corresponding speedups observed for 32 and 64 threads were 3.59 and 3.15, respectively. In this instance, using larger numbers of threads does not equal better performance. Moreover, taking a parallel efficiency of at least 60% as an acceptable minimum as an indicator of efficient use of hardware resources means that for this architecture it would not be worthwhile to run TINKER on more than 4 threads.
If you ever want to optimise a code, your first step in the optimisation process, after base-lining the performance, should focus on the compiler optimisations. Care is required though as changing compiler flags can both improve as well as degrade the performance of a code. Regardless, many scientific applications do not use compilers to their maximal advantage. Compilers have tens of optimisations options, which can be confusing. I usually have a look at the flags controlling the optimisation levels (-O0, -O1, -O2 and -O3) and platform-specific flags. After determining that the compiler settings have been chosen correctly for the given platform, it is good idea to check if the parallel directives are inhibiting the compiler optimisations by compiling the code with and without the OpenMP enabling flag (eg –fopenmp for gfortran) and executing the code on only one thread. For TINKER the parallel version on one thread was slightly faster than the serial execution. This implies that the parallel directives are not inhibiting the compiler from performing optimisations.
After sorting out the compiler settings, it is necessary to understand what is happening in the parallel sections of the code. OpenMP forks a number of threads in parallel sections, enabled through compiler directives, which are used to concurrently execute those parts of the code. This is what, hopefully, makes your code go faster. Just because the code has been parallelised correctly, ie in a way that produces the same results (but faster), it does not mean it has been done in the most efficient way. Therefore, the next step is to profile the parallelised code executed on a different number of threads. The profiling done using the OmpP profiler revealed that the most dominant routines in TINKER have been parallelised using parallel and do directives. It also showed that there are some overheads arising mostly from load imbalance, ie not every thread is given equal amounts of work to do, and thread management, eg the creation and merging of threads, both of which grow larger as the number of threads is increased. However, these parallel overheads are not enough to explain the extent of poor scaling of the code.
To reduce load imbalance between different threads and, at the same time, improve caching (storing data in fast memory close to the processor) and reduce false sharing (false sharing is when cores modify the same variable stored in a local core cache which then has to be communicated to the other core caches – this can severely degrade performance), it is necessary to find which OpenMP schedule is the most suitable for each do region. The schedule informs the compiler how to divide the data amongst the different threads. The code was optimised using various different schedules appropriate for the parallel region of the code being addressed and up to 50% improvement was obtained (when using gfortran). It now scales much better but the scaling is ‘acceptable’ (ie the parallel efficiency is at least 60%) for only up to 8 threads. The optimised code on 8 and 16 threads has a speedup of ~5.3 and ~6.6, respectively. Although the performance has been doubled, it is still not good enough on larger numbers of threads. Note that the performance improvement obtained will also depend on the compiler technology that you are using.
Taking into consideration that the parallel coverage (how much of the code is actually running in a parallel mode) is about 75% on 8 threads, a speedup of 5.3 seems to be a fairly good result. The question is why the corresponding behaviour does not extend to larger numbers of threads. The answer to that question seems to be related to the Non Uniform Memory Access (NUMA) effects. NUMA architectures, which nowadays are common, allow systems to be built with lots of cores that share a common large bank of memory. Accessing this memory though has different costs associated with it, depending on the distance between the core and memory location, that can degrade the performance. The machine used at EPCC is a cc-NUMA architecture – the cc means that it is cache coherent (a cache is put in front of the memory banks to mitigate some of the memory latency of accessing the separate memory banks), with 8 NUMA regions per node (2 cores x 4 modules). To investigate to what degree the NUMA effects affect the performance of the Tinker code, the Unix numactl command was used.
Using 8 threads and placing them in different configurations (one, two, four and eight per NUMA region) showed that the best performance was observed when there was only one thread per NUMA region. There is a drop in performance for two threads per NUMA region but not as significant as for four and eight threads. That means that the best performance is obtained when the sharing of the hardware resources (cache and memory controller) between the threads is kept to a minimum. The difference between one and two threads per NUMA region is not that great - meaning that the memory bandwidth is not saturated yet - however using more than 2 threads per NUMA region clearly causes memory bandwidth contention. Similarly, when 16 threads are used, although only 2 threads are placed in each NUMA region, the memory bandwidth is saturated. So, we believe this is what is causing the major performance bottleneck in TINKER.
There is no easy fix that would improve the performance of Tinker any further without a radical change to the code base. It may be possible to gain some improvement from restructuring all of the data structures to improve data locality. However, there is a limit to how much that would improve the performance and the effort required to do it is big. It seems more feasible to adapt TINKER to use a distributed memory model but that would also require a lot of effort and significant restructuring of the code.
The specifics of the optimisation process are code dependant, therefore it is natural that some of the codes will benefit from the optimisation process more or less than the others. The problem is that scientific applications were (still are?) often written with the focus of ‘getting the science bit right’ and do not give much consideration to the performance issues until much later. This should change.