Apple vs oranges: performance comparisons

 

Shall I compare thee...

Performance comparisons are always tricky to get exactly right. They are needed to ensure that we can demonstrate the performance improvements that optimisations, new hardware, new algorithms, etc... have had on an application or benchmark, but there is a lot of latitude in what can be compared, which makes it easy to get a performance comparison wrong and not properly demonstrate whatever it is you're trying to show.

If we go back far enough, it was often possible to do a direct comparison between serial and parallel versions of an application, and really get a good idea of how much faster the parallel version was compared to the serial version. However, modern computing hardware has made it increasingly difficult to do this fairly. Processors often change their speed when under-utilised (eg overclocking, turbo-boosting), as when a modern multi-core processor is running only a serial application.

Furthermore, only using a single core on a multi-core processor in a multi-processor node also means that core probably has access to more memory bandwidth than is the case if all the cores are being used.  Both of these factors can mean that a parallel version of a code can look like it has poor performance compared to the serial version simply because the serial version is getting proportionally more resources than the parallel version gets when benchmarked.

Because of these issues we often only look at the performance of full nodes when comparing different versions of a parallel program. On ARCHER, comparing 24 cores to 24 cores (a single node) after optimisation or algorithmic change feels like it will give a more sensible performance comparison. Or at least, comparing performance of programs using the same number of cores per node would be sensible. 

Large-scale benchmarking

The scale of data set/problem size needed for a parallel benchmark can also make it difficult to benchmark from a single core or small numbers of MPI processes.  If we're doing strong scaling benchmarking (keeping the data set/problem size fixed whilst we use more parallel processes or threads) then I need to have a problem size that is large enough to still give a reasonable amount of work to each worker when using thousands of threads or processors.

This usually means that the data set is too large to fit into the memory of a single node or a small number of nodes, meaning benchmarking is done starting at multiple nodes and scaling out. The alternative would be to do weak scaling (having a fixed data set/problem size per process or thread and therefore increasing the overall problem size when using more processes or threads). However, it is notoriously difficult to get the data set/problem size correct per thread/process and to ensure it doesn't vary very much when scaling to larger core count.

Parallel applications often aren't designed to decompose data or problems in such a fixed way, they generally try to balance a decomposition as much as possible and this means constructing the problems/data sets for the range of core counts you are benchmarking on is a lot more work and a lot more difficult than simply using one benchmark case (i.e. strong scaling).

However, because we start benchmarking not from a single process or thread, or even from a single node, but from many processes/threads/nodes, we can end up hiding poor performance behind good scaling. Figure 1 is a good example of this: the code appears to scale perfectly, being in line with the perfect scaling curve we have constructed. 

 

Figure 1: MPI scaling compared to ideal scaling

However, this perfect scaling curve is created by simply taking the runtime at the lowest core count and then dividing it by the factor the resources (number of cores in this case) have increased as we've scaled the benchmark. If the time at the lowest core count is already inefficient then the rest of my scaling will look good. I know that in this case the graph is flattering the performance because if I do some more in-depth profiling I can see that around 20% of the runtime is being spent in MPI for this application.

20% of the runtime is spent doing communications which are an overhead for this application and if we were able to compare to a single process run we would see that impact reducing scaling performance as we increase the amount of resource we use. However, starting on a reasonable number of nodes we have already internalised that performance penalty and we don't see any performance issue. 

Now Figure 1 still shows something interesting, and something to be happy with. The communication overhead, the cost of the parallelisation, doesn't seem to be getting worse as we scale up to much larger numbers of cores. That's a good result. But it's not the same as being able to authoratively say that this application scales perfectly. All we can say is, it scales perfectly (or at least very well) if we start at this particular point, and with this particular benchmark.

Manycore vs multicore

So, performance comparisons are already complicated by current hardware and scaling issue. However, new hardware has made this even more complicated. Intel's KNL processor is an interesting target for performance analysis and application benchmarking. Indeed, we have done quite a lot of this during our IPCC work. Figure 2 shows a comparison of an application's performance on ARCHER and on ARCHER's KNL test system.

 

Figure 2: ARCHER vs ARCHER KNL: MPI to MPI process comparison

Naively reading Figure 2 you would be forgiven for saying that the KNL doesn't perform as well as the ARCHER system. On a process-process comparison (same number of MPI processes run on the two systems), the KNL system certainly seems slower.  At all the process counts, the KNL system is slower than ARCHER.

However, if instead of comparing MPI process counts we compare the number of nodes used, as shown in Figure 3, then we see a different picture. 

Now, the KNL system outperforms the standard ARCHER nodes. The KNL system has 64 cores per node, the ARCHER system has 24 cores per node. Each MPI process is slower on the KNL system than the standard ARCHER nodes, but the fact that there are nearly 3x more cores in the KNL node mean we can complete simulations faster on the KNL system for the same number of nodes when compared to the multi-core system.

 

Figure 3: ARCHER vs ARCHER KNL: node to node comparison

This also seems to be a better way of comparing performance for this case because on ARCHER we charge on a node allocation basis, not by numbers of MPI processes or threads used, and the energy used to complete my simulation is much more strongly related to the number of nodes I used rather than what I fill those nodes with. As energy consumed by the application is a large part of the cost of running that application, this node-to-node comparison seems to be a better way of comparing performance. 

Node to node, hardware to hardware

It also allows comparing different kinds of hardware, GPUs to manycore nodes, multi-core to GPUs, etc...  Comparing a node to a node seems to be a fair way for most situations.   Having said that, there is hardware/setup that breaks this fairness. NVIDIA's DGX-1 server  can be considered a single node, but contains 8 GPUs and probably consumes a lot more energy that a single KNL node or a standard multi-core node. 

Likewise, Google's TPU processor can do neural network simulations much quicker than GPU or multi-core processors, and for a much smaller amount of energy.  However, it's not general purpose and can't run a whole range of different applications, so maybe it's not a fair comparison either.

Therefore, when we're doing these comparisons we probably want to evaluate the energy consumed by the nodes to see if they are similar enough to allow comparison, and also the range of applications they will run. Possibly also the effort required to port to them would be good to consider, although this is hard to objectively evaluate. 

Whatever, some careful thought is needed when doing anything other than comparing two different versions of the same application on the same hardware with the same test case.  Even then, it's worth thinking about what performance you are really interested in (and usually it's actually time to solution rather than scaling or some other metric like that) and how you can fairly demonstrate what has been achieved.