MPI performance on KNL
Posted: 30 Aug 2016 | 12:22
Knights Landing MPI performance
Following on from our recent post on early experiences with KNL performance, we have been looking at MPI performance on Intel's latest many-core processor.
The MPI performance on the first generation of Xeon Phi processor (KNC) was one of the reasons that some of the applications we ported to KNC had poor performance. Figures 1 and 2 show the latency and bandwidth of an MPI ping-pong benchmark running on a single KNC and on a 2x8-core IvyBridge node.
The latency and bandwidth of the standard (IvyBridge) node are significantly better than on the KNC when comparing 2 processes on each processor, and the differences are more pronounced when both nodes (the KNC and the IvyBridge node) are full of MPI processes.
It could be argued that comparing 240 MPI processes on the KNC to 16 MPI processes isn't fair, after all the performance of MPI will degrade as the process count increases.
Furthermore, the conventional wisdom for many-core processors such as the KNC and KNL is that the best performance is achieved using hybrid applications (ie MPI+OpenMP) to reduce the number of MPI processes required to use all the cores on the processor.
However, we work with a range of large-scale computational simulation applications that either don't yet have hybrid parallel functionality, or only have limited hybrid parallelisations. Therefore, the performance of pure MPI applications on processors is important to us.
Even so, 240 MPI processes on the KNC could still be regarded as excessive, for the KNC processor we were using only 120 MPI processes would be required to effectively use the hardware. However, even comparing just 2 processes for the ping pong benchmark, the performance is still much better on the standard host than on the many-core processor.
It should be noted that all the benchmarking we are reporting on in this blog post are performed using only a single node (i.e. a single KNL or KNC, or a single node with a couple of multi-core processors). For the KNL benchmarking we are using Intel MPI 5.1.3, for the KNC and host benchmarking we are using Intel MPI 5.0.3.
Strange MPI performance
The observant amongst you will spot our host results have strange performance characteristics, namely that the 2 process benchmark performs worse than the 16 process benchmark. This is a ping pong benchmark where we only use 2 processes to send and receive data regardless of the number of MPI processes being used, the rest of the processes simply wait in a barrier.
On the host when we run the 2-process benchmark we specify that each process runs on a separate processor (our host has 2 x 8-core IvyBridge processors). When we run with 16 processes we simply fill both the processors up with MPI processes, but this means that the two MPI processes chosen to do the communications end up on the same processor, hence get better MPI message performance as they are not having to send messages across the inter-processor network, only copy data too and from the memory closest to that processor.
In effect our benchmark is demonstrating the NUMA performance affects on a multi-processor ccNUMA system.
Given that the second-generation Xeon Phi processor (KNL) has a completely re-designed network, the hope is that the MPI performance has been improved as well. KNL has a 2D cache coherent mesh interconnect that is used for memory accesses by the individual cores (each core on the KNL is part of a 2-core tile), and should provide fast access to both main memory and the MCDRAM than the ring bus that the KNC had.
However, the question is how has the MPI library been implemented for the KNL? It is unlikely to be able to exploit direct cache accesses and will need to go to main memory or the MCDRAM for the on-processor MPI communications, but this should mean the latency is similar to a standard node, where the same thing happens. We re-ran the same ping pong benchmark as on the KNC and the results are shown in Figures 3 and 4.
We can see from these results that the latency and bandwidth for KNL are significantly better than KNC and much closer to the IvyBridge host. Particularly, the latency is much closer to what is seen on the standard node. The bandwidth isn't as high, but then we are only using 2 cores on the KNL and it is likely that you need to use more KNL cores to get close to the peak bandwidth of the chip, whereas on the IvyBridge processor it is possible to get the full memory bandwidth from a single core.
To get a better picture of performance when actively using all cores on the nodes we've also run an AlltoAll benchmark on various numbers of MPI process counts, with the latency of the AlltoAll operation shown in Figure 5.
As with the ping pong benchmarks we can see that KNL out-performs KNC significantly. Indeed, for large message sizes the KNL outperforms the IvyBridge system for low to medium process counts.
Interestingly, the only place where the performance isn't as good is when using 64 MPI processes on the KNL with large message sizes. It appears that when the network becomes saturated (ie large message sizes) the performance of the KNL does degrade and become more comparable with the KNC. However, for smaller (and arguably more realistic) message sizes the KNL is still significantly faster than KNC.
When running all these benchmarks the KNL we were using (64-core 7210 Xeon Phi) was configured in flat memory mode (ie we weren't using the MCDRAM) and the network was in quadrant mode (the KNL interconnect between cores can be configured in different ways, more on this in future blogs).
Our next steps now are to look at MPI performance when using more than one KNL, investigating the performance of the Omnipath interface on the KNL, and also seeing whether using the MCDRAM for MPI communications (ie configuring the KNL in MCDRAM caching mode) affects the performance of communication (it should improve bandwidth but may impact latency).
We will follow this up shortly in a new blog post.