The Intel Parallel Computing Centre at EPCC

Author: Adrian Jackson
Posted: 15 Jun 2017 | 13:41

We are entering the fourth year of the Intel Parallel Computing Centre (IPCC). This collaboration on code porting and optimisation has focussed on improving the performance of scientific applications on Intel hardware, specifically its Xeon and Xeon Phi processors.  

These processor varieties represent the two ends of the processor spectrum we see in modern computing platforms. Xeon is the multi-core processor line used in desktop and server class systems; Xeon Phi is the many-core processor line for computational simulation applications.

The primary difference between the two types are the number of compute cores on the chip (60-70 for Xeon Phi, 10-20 for Xeon), and the support for floating point calculations. The Xeon Phi processor supports very large vector instructions (performs many floating point calculations in a single instruction cycle), with up to two 512-bit wide vector units (8 double precision numbers) per core; the Xeon processor supports single 256-bit or 512-bit vector units.

If we optimise the use by an application of these large vector units on a Xeon Phi processor, we should also get benefits on the Xeon processors. It is possible to vectorise applications by hand (eg using intrinsic instructions), but our focus has been on restructuring codes to ensure the compiler can vectorise the code effectively.

Improving vectorisation 

One example of such work was a collaboration with Dr Angus Creech of the Institute for Energy Systems in Edinburgh, where we looked at improving the vectorisation of the CFD modelling package Fluidity.

Many of the key computational kernels in Fluidity are generalised to allow different types of simulation. However, this impacts performance as it inhibits the compilers’ ability to optimise these routines.

To optimise the routines for tidal simulations, more specialised versions were created that converted dynamic array allocations to static allocations, added compile time loop length definitions to enable the compiler to vectorise the loops, and inline the main computational routines.

This was combined with code that selects optimised routines at run time if the correct type of simulation is being run, and uses the original code for functionality that hasn’t been optimised. This enabled an approximate two-times speedup of the whole code, with simulations completed twice as fast as before.

Future plans

In this fourth year, we will be looking at similar applications and optimisation work, including optimising the statistical programming language, R, for the Knights Landing processor; investigating performance portability for applications across Xeon, Xeon Phi, and GPU processors; and developing models to understand the best ways for applications to use the Xeon Phi’s high bandwidth memory.

If you’re interested in the Xeon Phi processor, in how your application may perform on the hardware, or any other aspect of this work, please don’t hesitate to contact me.

Intel Parallel Computing Centres:
https://software.intel.com/en-us/ipcc

Fluidity project:
http://fluidityproject.github.io

The image above was produced by the CFD modelling package Fluidity.

Author

Adrian Jackson, Research Architect, EPCC
Twitter: @adrianjhpc

Blog Archive