Day 3 of optimising for the Xeon Phi, moving on to vectorisation

Author: Adrian Jackson
Posted: 11 Jun 2015 | 16:01

Moving from OpenMP to vectorisation and MPI

Reality hit home a bit on the third day of our intensive week working with Colfax to optimise codes for the Xeon Phi.

After further implementation and analysis work it appears that the removal of the allocation and deallocation calls from some of the low level routines in CP2K will improve the OpenMP performance on Xeon and Xeon Phi, but only because there is an issue with the Intel compiler that is causing poor performance. The optimisation can see a reduction in runtime of around 20-30% for the OpenMP code, but only with versions 15 and 16 of the Intel compiler, on v14 there is a much smaller performance improvement.

However, it is still worth investigating so we are pressing on with implementing to a wider set of the code to see what performance benefits can be achieved. We will also have to test with other compilers (such as the GNU compilers) to check that the code modifications are beneficial to the majority of the code users.Profiling data on CP2K OpenMP performance from Intel's Vtune tool

Further optimisation of the OpenMP code may not be possible as the current way the code is implemented means that there are times when some threads are busy working whilst others are idle, or working less often, as shown in the timeline in the figure (above). This is primarily to do with how the loops in the code are structured, but is generally never a problem for normal CP2K users as they don't try to run with as many OpenMP threads as we are investigating here. The Xeon Phi, with its support for large numbers of threads, but low memory per core configuration, often requires approaches that can make maximum use of multi-threads, which is why we are pushing the OpenMP performance on CP2K.

Tomorrow we will be going on to look at the vectorisation performance of CP2K (as if it does not vectorise well then it won't be possible to get good performance on Xeon Phi at all) and the MPI communication pattern to see if there is any scope for optimisation there. We may also look at another code, the plasma simulation code GS2, as well.

Whilst helping one of our MSc students with porting a code to Xeon Phi we also came across another issue we hadn't seen before. The Xeon Phi requires data to be element wise aligned (more information on this can be found here). What this means is that if you are accessing reals or double precision data (ie 64-bit data structures), then they need to be allocated and accessed through addresses that are a multiple of 8 (as they are 8-byte data structures). Likewise, if you are allocating and accessing integers (32-bit, or 4-byte data structures), they need to be allocated on an address that is a multiple of 4.

This is not generally an issue for FORTRAN, or most C, codes. However, the library we are trying to port does a number of pointer arithmetic operations in C, which end up falling foul of this requirement on the Xeon Phi (but work quite happily, though not necessarily efficiently, on standard Xeon processors). Here is an example program that exhibits the problems:

#include <stdio.h>
#include <stdlib.h>
#define N 100

int routine(int *arr){
   double *local_arr;
   local_arr=(double *)(arr+1);

   local_arr[0]=1.;
   printf("Value: %lf \n",local_arr[0]);
   return 0;
}

int main(){
  int *arr;

  arr=malloc(N*sizeof(int));
  routine(arr);

  return 0;
}

The above program will crash on the Xeon Phi as it is accessing an array that you have told it is a double (local_arr) but using pointer arithmetic to move the address on by an integer size, and this violates the restriction for access of data on the Xeon Phi. Granted, I don't think it's a nice bit of code anyway, but it was certainly a puzzle to us initially as to why this was crashing and took an email exchange with an engineer at Intel to clear it up. I guess the student may have to re-write parts of the library if we want to port it to the Xeon Phi.