Early experiences with KNL

Author: Adrian Jackson
Posted: 29 Jul 2016 | 16:45

Initial experiences on early KNL

Updated 1st August 2016 to add a sentence describing the MPI configurations of the benchmarks run.
Updated 30th August 2016 to add CASTEP performance numbers on Broadwell with some discussion

EPCC was lucky enough to be allowed access to Intel's early KNL (Knights Landing, Intel's new Xeon Phi processor) cluster, through our IPCC project.  KNL Processor Die

KNL is a many-core processor, successor to the KNC, that has up to 72 cores, each of which can run 4 threads, and 16 GB of high bandwidth memory stacked directly on to the chip.

Furthermore, they can also be purchased with a network interconnect (Intel's Omnipath network) on-chip, and they have access to standard main memory, like a normal processor does.

One of the major differences between KNL and its predecessor, KNC, is that this new processor can be self-hosting, ie it does not need to be alongside a standard processor as current GPUs or KNCs do.

Porting computational simulation applications

We have been working for a number of years on porting computational simulation applications to the KNC, with varying successes. We were keen to test this new processor with its promise of 3x serial performance compared to the KNC and 5x memory bandwidth over normal processors (using the high-bandwidth, MCDRAM, memory attached to the chip).

The processors we had access to were pre-release versions, so slightly lower spec than those that you're likely to encounter in upcoming systems. Specifically we were using the 7210 version running at  1.30GHz, with 64 cores, 16 GB MCDRAM, and access to 96 GB DDR4 running at 2133 MT/s.

Application performance

For the initial performance we chose to run three applications that we have worked on before, or know are important applications in terms of UK HPC usage:

  • COSA: CFD simuation code; Fortran parallelised with MPI
  • GS2: Gyrokinetic simulation code; Fortran parallelised with MPI
  • CASTEP: Density functional theory materials simulation code; Fortran parallelised with MPI

The attentive amongst you will have noticed a pattern in the codes we've chosen, they're all Fortran codes parallelised with MPI. However this isn't surprising if you know that 70%+ of the usage of our large-scale HPC systems are exactly this. I should also note that all the above codes have some hybrid versions/functionality (generally MPI+OpenMP), however we're not using the hybrid versions in these benchmarks as they tend not to be as efficient on small-scale parallelisations (the hybrid functionality tends to help parallel scaling rather than performance).

HPC hardware

We assessed the KNL performance against a number of other systems: 

  • A KNC system we have at EPCC ( 5110P Xeon Phi 1.053 GHz 60-core accelerators)
  • ARCHER, the UK's IvyBridge based large-scale HPC system (two 2.7 GHz, 12-core E5-2697 v2 processors)
  • A new EPCC HPC system that contains Broadwell-based nodes (two 2.1 GHz, 18-core E5-2695 v4 processors).

Our tests only use a single node of each system (so up to 24 cores on ARCHER, 36 cores on the Broadwell-based system, and a single KNC or KNL in the accelerator systems).  We're not testing scaling beyond single nodes at the moment, although this will be an interesting aspect to look at given the Omnipath network connections that can be used on the KNL.

We've also not done any performance optimisations for the KNL in these applications, we've simply re-compiled the code for the KNL and run.

Benchmark results

The results from our initial benchmark runs are shown in the following table. The results are run times (in seconds) of the applications across the range of platforms using the same test case on each platform. The benchmarks were run 5 times and the fastest time has been chosen for the table.  All of the applications are run using MPI parallelisations only, and the fastest MPI process count configuration found is used on each architecture (i.e. for COSA the results are using 61 MPI processes on KNL, 24 MPI processes on IvyBridge and 36 MPI processes on Broadwell; for GS2 the results are using 64 MPI processes on KNL, 24 on IvyBridge, 36 on Broadwell, and 235 on KNC; because these were the configurations that run fastest on each architecture).

Application KNC KNL KNL HB IvyBridge Broadwell
COSA   561 450 497 349
GS2 400 184.2 103.8 126.6 83.4
CASTEP   149 146 102 38

The first thing to point out is we don't currently have results for all systems for all benchmarks. The benchmarks we chose for COSA and CASTEP did not fit in the memory available on the KNCs we have (they are limited to 8GB of memory), so we couldn't get results for KNC for these tests, and we've not yet compiled CASTEP for our Broadwell-based system (it's under maintenance at the moment, we should have it shortly and will update this post once we have that sorted out).

The second thing that needs to be cleared up is the difference between the KNL and KNL HB columns. The KNL column has the results for running the code on KNL using on the main DDR memory in the node, not the MCDRAM on the KNL processor, while the KNL HB column has the run times for the benchmarks when the applications have been run using MCDRAM (either by modifying the application or using cache mode).

KNL memory

At this point it's worth a little aside about the KNL MCDRAM memory. The current KNL processors have 16 GB of this memory on processor, as well as direct access to the main memory in the node.  This means the KNL can access a very large memory space (depending on what standard DRAM is installed in the node up to 1TB of memory), which is new for many-core processors (the KNC and current generation GPUs don't get such access to main memory, although next generation GPUs will be able to). However, the MCDRAM is also novel hardware for the type of processors we use for computational simulation.

The MCDRAM can be set up/used in different ways on the KNL:

  • Flat: the MCDRAM and main memory are two separate memory spaces
  • Cache: the MCDRAM is a cache for main memory
  • Hybrid: part of the MCDRAM is used as cache for main memory, part is used as a separate memory space.

These modes have to be chosen at boot time for the node, a reboot is required to switch between memory modes.

In flat mode both memory spaces are available and can be used by an application (the MCDRAM and the main DDR memory), but you need to modify your application, or run it with a helper application (more about this in future blog posts), to exploit the MCDRAM (DDR memory is available as normal to any application).

In cache mode the application no longer sees the MCDRAM, it simply acts as a cache for main memory, providing higher bandwidth memory access to data in main memory. However, it should be noted that whilst MCDRAM is higher bandwidth than standard main memory, it is also slightly higher latency (due to the extra logic needed in addressing the different banks of memory in the MCDRAM). This means applications where performance is dominated by memory latency MCDRAM will not help performance.

IvyBridge vs KNL

We can see from the benchmarks that performance is comparable between the KNL and our IvyBridge system, it's ~20%-50% slower on the KNL, but then the KNL processors we are using are lower spec than the ones that will be used in KNL-based HPC systems. 

We also see that the KNL is much faster than the performance we get with KNC. We often saw performance of 2x or 3x slower than IvyBridge, even after application optimisation on KNC, but without optimisation the KNL performance is looking reasonable.

High bandwidth memory

The results also show that the high bandwidth memory can provide significant performance benefits for applications, enabling both COSA and GS2 to run faster than the equivalent IvyBridge benchmarks. This is a great result, we are now getting better performance on a single KNL than on a two-processor node in our HPC system.

However, we can also see that there are also applications where the MCDRAM doesn't seem to help.  CASTEP shows very little performance benefit from using the MCDRAM.  We still need to do proper profiling to work out what the performance limiter on CASTEP is, and this should help us understand why the MCDRAM doesn't help (I suspect that CASTEP is memory latency dominated).

Broadwell

The next thing we can see is that, whilst the KNL is performing very well compared to IvyBridge, it's not performing as well as the Broadwell processors we have in our new EPCC system. Broadwell is a fairer comparison for KNL, as Broadwell is a similar generation processor to KNL. Broadwell is performing very well with a simple re-compile and run, and shows that there is scope for us to do some optimisation work on the applications to get better performance on KNL.

It is also interesting to see the performance difference between IvyBridge, KNL, and Broadwell for CASTEP. The high bandwidth memory is not providing a benefit for CASTEP on KNL, however the improved core count, vectorisations, clock speed, and memory bandwidth seem to be significantly helping CASTEP. Given the Broadwell system has 50% more cores in the node than the IvyBridge system, we could expect CASTEP to run at least ~70 seconds on Broadwell compared to IvyBridge. However, we can see that at 38 seconds, it is 70% faster again than the simple core count improvement would suggest.

It looks like there is some space opening up between the multi-core and many-core architectures that Intel is producing, with Broadwell and KNL giving different performance improvements for different types of codes. Certainly there's some work to do to identify what code features work best on what architectures!

We've got some other things we'd like to benchmark and understand, the first on the list is MPI performance on the KNL, so stay tuned for another blog post shortly on MPI benchmarks!

This post has also been published on Medium.

Author

Adrian Jackson, EPCC
You can often find Adrian on Twitter.