Day 4 of IPCC-Colfax work at EPCC
Posted: 12 Jun 2015 | 15:41
MPI and vectorisation: Two ends of the optimisation spectrum
Day four of this week of intensive work optimising codes for Xeon Phi saw a range of work. The majority of the effort focussed on the vectorisation performance of CP2K and GS2; looking at the low level details of the computationally-intensive parts of these codes and seeing whether the compiler is producing vectorised codes, and if not is there anything that can be done to make the code vectorise.
All this optimisation work has been driven by profiling and analysis using a range of different tools. As we are aiming for good performance on the Xeon Phi we have primarily been using Intel's Vtune profiler, which will provide a wide range of detail about a codes computational and memory performance on Xeon and Xeon Phi processors. The screenshot below shows an example of profile information from the OpenMP version of CP2K.
VTune profile of CP2K OpenMP code
We have also used Allinea's MAP profiler, as this will also analyse code performance of Xeon and Xeon Phi processors, and it is often useful to have the output of two profilers to compare and contrast the results. Indeed, we hope to shortly write up a description of this process for the IPCC web page on the EPCC website detailing how to profile codes in the different profilers and compare the results. Intel's Vector Advisor tool has also been useful to help with understanding vector performance and problems. Vector Advisor does not work with Xeon Phi processors, but will help understand the loops inside a program, which will vectorise well, and where there are problems that will hamper vectorisation.
EPCC has previously undertaken significant work attempting to improve the vectorisation performance of CP2K, documented in this report. However, it's always good to have another set of experts take a look at a problem, a fresh set of eyes and a different approach can often find problems and opportunties previously overlooked. So far we've not found any major new code changes or issues that could improve the vectorisation of CP2K. One of the challenges of the core computational kernels in CP2K is that they often iterate over small loops (tens of iterations). This often means that it is possible to vectorise these loops, but does not provide large performance benefits from doing so. One of the approaches we have considered investigating in the past is to look at implementing vectorisation at a higher level in the code (vectorise over the outer loops of the computational kernels rather than the inner loops). This does involve some restructuring of the code, so we haven't done proper investigations of this optimisation yet, but the work we have done this week on CP2K vectorisation does confirm that this is probably the only real choice we have to get better vector performance so we will focus on this over the coming weeks.
Finally, we have also been investigating the MPI performance of CP2K, using both CrayPat (Cray's profling tool) and ITAC (Intel's Trace Analyzer and Collector). In general CP2K scales well using MPI communications, but there are a couple of places where performance could be improved. We identified that a lot of time was being spent waiting for messages, and also waiting for collective operations to create MPI communicators, as shown in the following pictures.
Picture of the ITAC profiler timeline for MPI messages in a CP2K run
The MPI part of CP2K seems to be spending quite a lot of time waiting for messages to arrive, or waiting on collective operations (particularly the MPI_Comm_create routine). Looking at the operations the code is doing it should be possible to do this work without waiting for other processes as the data being sent and received is generally read only. It looks like processes are effectively locking sections of matrices whilst they work on them, but these matrices should be read only for this part of the operation. However, we have only been looking at this operation for a short amount of time so it is very possible that there are things going on here that we haven't fully appreciated and there are good reasons for these waits. Still, if we can remove these waits we have a chance to signfiicantly improve the performance of the code as the wait routines come out as the highest item on our profile for the particular core count and benchmark being used here. This performance improvement would be even more significant on the Xeon Phi as it generally has slower MPI performance than the host and requires more MPI processes.
ITAC profile showing processes waiting for MPI_Comm_create to finish
We have also been looking at the vectorisation performance of GS2. The profiler and Vector Advisor have reported that there are a number of loops that take a large amount of run time for the code, but also that they tend to be memory bound rather than compute bound (waiting on data to be loaded from memory rather than waiting for the processor to compute instructions). The core compuational loops are vectorising well, there is no sign of significant problems here, although like CP2K the loops are quite small so there may be scope in collapsing them with the outer loop to give a bigger number of loops to vectorise over. Another suggestions has been that there may be merit in splitting a large loop body up into smaller sections, storing the temporary results of each of those loop sections in aligned vector arrays, and completing the loops by iterating over each of these sections (storing the temporary values in fully aligned arrays and using smaller loop bodies may help with vector performance).
However, given that we seem to be dominated by the memory performance rather than compute performance for these loops, another valid approach would be to restructure the loops so they only contain one significant array per loop, iterating over the elements of an array required for that section of the loop and then moving on to the next section of the calculation in a separate loop. There is a chance this may improve cache usage and reduce register pressure on the processor and therefore improve performance.
We are going to undertake a couple of small tests on GS2 using both of these approaches to see if it does make a performance difference for the code and then take developments from there.