Intel Parallel Computing Centre: progress report
Posted: 21 Nov 2014 | 10:29
EPCC's Grand Challenges Optimisation Centre, an Intel Parallel Computing Centre which we announced earlier in the year, has made significant progress over recent months.
The collaboration was created to optimise codes for Intel processors, particularly to port and optimise scientific simulation codes for Intel Xeon Phi co-processors. As EPCC also runs the ARCHER supercomputer, which contains a large number of Intel Xeon processors (although no accelerators or co-processors), for EPSRC and other UK research funding councils, we also have a strong focus on ensuring that scientific simulation codes are highly optimised for these processors. Therefore, the IPCC work at EPCC has been concentrating on improving the performance of a range of codes that are heavily used for computational simulation in the UK on both Intel Xeon and Intel Xeon Phi processors.
Intel Xeon Phi co-processor
The Intel Xeon Phi co-processor is an interesting piece of hardware, with characteristics of both standard processors and accelerator hardware (such as GPUs). Unlike GPU accelerators, the Xeon Phi uses fully functional processor cores (Intel Pentium P54C) with cache coherency which, whilst being simplistic and low-powered compared to modern processor cores, are able to run full programs, and even operating systems. These cores have also been augmented with a large vector unit (512-bit), meaning at each clock cycle the core can operate on 8 double-precision numbers at once. Furthermore, the cores have hyperthreading, or symmetric multi-threading, functionality enabling each core to efficiently host 4 virtual threads. The Xeon Phi generally has 60 of these cores on a single co-processor, along with 6–16 GB (depending on the model of Xeon Phi being used) of fast GDDR5 memory. However, the cores do run slower than modern processors, generally clocked at around 1 GHz.
Using a Xeon Phi co-processor
There are two main ways of using a Xeon Phi co-processor. The first way, called native mode, involves compiling an application to run directly on the processor and then running MPI, OpenMP, or hybrid (MPI+OpenMP) parallelisations across the co-processor to use the 60 cores and the multiple threads available per core. However, to efficiently utilise the co-processor this way requires efficient parallelisations as any serial parts of the program (running on a single core/thread) will run slowly compared to running the same program on a modern processor (due to the simple Pentium core and low clock speed).
The second way, called offload mode, is to run the Xeon Phi as a true co-processor, running the main program on the main processor in the system and running computationally intensive kernels or subroutines on the Xeon Phi (transferring data to and from the Xeon Phi as required and parallelising across the cores and threads available on the Xeon Phi using OpenMP or alternative threading solutions).
Optimising scientific simulation programs
For our IPCC collaboration, which initially is scheduled for two years, we are looking at optimising four scientific simulation programs; CP2K, GS2, COSA, and CASTEP. In the first 6 months of the work we have been actively working on CP2K, GS2, and COSA.
CP2K: a materials science simulation code
CP2K is a materials science simulation code that EPCC has a long history of working with, and indeed we previously carried out a port of CP2K to Xeon Phi co-processors for the PRACE project, as documented here. That work found that whilst it is possible to compile and run CP2K as a native application on the Xeon Phi, the performance did not match that of running it on standard Intel processors. However, we have a number of possible avenues for improving that performance, particularly around the vectorisation of the code by the compiler. As the Xeon Phi has very wide vector units but a low clock speed, to get good performance from the co-processor it is necessary to ensure that any code running on the system is using the vector units efficiently.
Vectorisation is the process of mapping a serial operation (ie a = b + c) onto hardware which can perform multiple of these operations as a single instruction. While for most programs it won't be useful to run a = b + c multiple times because scientific applications generally operate on large arrays or matricies, a more common operation would be to sum the elements of two vectors and assign them to a third ie a[i] = b[i] + c[i]. If this operation is done in a loop from the first element of the vectors to the last element of the vectors then we can see that we have multiple operations that can be done independently on different but consecutively located data. For a scalar program these would be done consecutively, one per clock cycle. However, vector hardware aims to do multiple of these operations in a single clock cycle; for the Xeon Phi core which has 512-bit vector units this could be eight of these operations (on 64-bit numbers) each clock cycle.
However, vectorisation generally relies on compilers recognising that particular parts of code can be mapped to the vector hardware and run at the same time, and this is not alway straight forward, meaning that there can be codes that are not achieving the performance they should be, either from Xeon Phi or standard Intel or AMD processors, as they are not operating using the vectorisation functionality available in the hardware. A significant part of the work we are undertaking in this collaboration with Intel is to examine how well the compiler is vectorising our chosen codes, and improve that functionality by restructuring the codes where appropriate and possible.
We are undertaking such work for CP2K, however a pre-requisite of any optimisation of a large, and complex, simulation package such as this is that we can effectively test and verify the code. CP2K has a test suite with regression tests designed to undertake this task, however they did not work using Intel compilers (a pre-requisite for us using the Xeon Phi). Therefore, we have been working to integrate the test suite with Intel compilers and libraries so that we can run the test suite on the Xeon Phi. Now that work has been completed we are evaluating the vectorisation performance of CP2K.
GS2: an initial value flux-tube gyrokinetic plasma simulation code
The second code we planned to investigate was GS2. This is an initial value flux-tube gyrokinetic plasma simulation code, written in FORTRAN and parallelised using MPI. GS2 has recently been optimised for large scale simulations, as documented in this blog article. However, the nature of the simulations undertaken in GS2 mean that it is heavily dominated by MPI communications when using large numbers of processes. Therefore, if we want to port this to the Xeon Phi, using native mode, we need to try to address the problem of the MPI cost increasing as we go to large process counts. The natural solution to this challenge is a hybrid parallelisation, adding OpenMP to the existing MPI parallelisation, to enable GS2 to use higher number of cores or threads without requiring larger numbers of MPI processes to be employed (which would require the splitting of the simulation domain across processes and thus entail more MPI communications).
Scaling GS2 on ARCHER
We have implemented a hybrid version of GS2, and explored the performance of the parallelisation versus the original MPI code. Figure 1 shows the scaling we get on ARCHER for a representative simulation case. We can observe that at lower core counts (256 and 512 cores) the original MPI parallelisation out performs the hybrid code. However, when scaling beyond those core counts the hybrid version performs better, with up to 25% quicker simulations that the pure MPI version. It is also clear from the graph that, whilst the hybrid code is faster at higher core counts, the parallelisation is only more efficient when using small numbers of OpenMP threads, with 2 or 3 OpenMP threads per MPI process giving optimal performance for this case.
Figure 1: Scaling of the GS2 code on ARCHER comparing the pure MPI code and the hybrid code using various numbers of OpenMP threads per MPI process.
We have also run the hybrid code on a Xeon Phi co-processor (using a different, smaller, test case to fit into the memory available on a single Xeon Phi). When we ran the pure MPI code, it performed, at best, 1.5x as slow as the same code running on one processor on the host machine using 8 MPI processes (the host machine has 2 eight-core Intel Xeon processors and 2 Xeon Phi co-processor cards), and 2.3x as slow as the same code using the two processors (16 MPI processes) on the host machine. However, this was comparing 236 MPI processes on the Xeon Phi, compared to 16 MPI processes on the Xeon processor, an unfair comparison as for the Xeon Phi case the larger number of MPI processes creates higher communication overheads in GS2.
To reduce the MPI processes required on the Xeon Phi we ran the hybrid verison of GS2, using 3 OpenMP threads per MPI process. This enabled us to use a third of the MPI processes, and gave a performance on parity with one processor, and 1.5x slower than the two processor result. Whilst not ideal this is still good progress, and gives us hope for optimising the performance of GS2 on the Xeon Phi so that it runs faster than two standard processors.
COSA: a frequency domain CFD code
Finally, we have been doing similar work on the COSA code, a frequency domain CFD code that employs harmonic balance algorithms to provide fast simulations of turbomachinery and similar scientific scenarios. COSA is another FORTRAN code that has previously been parallelised with OpenMP, MPI, and OpenMP+MPI. Thus it was straight forward to port this code to the Xeon Phi and run some benchmarking work. The code as it currently stands, runs 20% faster on a Xeon Phi than 8 MPI process in one host processor, but 40% slower than 16 MPI processes across the two host processors. However, on examining the code, and how well the compiler vectorises it, we have identified a number of significant computational kernels that are not properly vectorising so it looks like there is good scope to increase the performance of COSA on the Xeon Phi.
Adrian Jackson, EPCC