HPCG: benchmarking supercomputers
Posted: 30 Jul 2015 | 14:40
The LINPACK library (often known as HPL) has been used to benchmark large-scale computers for over 20 years, with the results being published in the Top500 list. But does it accurately reflect the performance of real applications?
ARCHER currently is number 35 in the list (having debuted at number 19 when first installed), with a performance of 1.6 PFlop/s (RMAX; maximum performance on the HPL benchmark) out of a potential peak performance of 2.55 PFlop/s (RPEAK; maximum theoretical floating point performance of the hardware in the system).
However, there are drawbacks with using HPL as a benchmark for assessing the performance of large-scale computing systems. Primarily HPL stresses floating point performance, running algorithms that perform complex linear algebra, but that for the most part work on data that is stored in local caches and can be computed independently. Whilst floating point performance is key for a lot of computational simulation programs, they are also often limited by the performance of other parts of the machine such as the speed of memory, or the bandwidth and latency of the network.
HPL is generally able to get a reasonable fraction of peak performance on a large HPC system (on ARCHER, for instance, it achieves around 63% of the theoretical peak performance of the machine), but most programs that use large HPC systems don't achieve anywhere near that performance (getting towards 10-15% of the peak performance of the machine for a simulation code would be considered good performance). Therefore, whilst HPL is good at characterising the raw floating point performance of a system, it often does not give a good reflection of the performance that actual applications may see on that machine.
High Performance Conjugate Gradients (HPCG) is a benchmark for large-scale parallel systems that aims to address, at least partially, this problem with HPL. The conjugate gradients algorithm used in the benchmark is not just floating point performance limited, it is also heavily reliant on the performance of the memory system, and to a lesser extend on the network used to connect the processors together. This means it should provide a performance assessment that is more like what real applications could achieve for a given system when compared to HPL, although real applications have a wide range of performance characteristics so it cannot capture the important performance features for every application that could use the system.
HPCG and ARCHER
We recently ran the HPCG benchmark on ARCHER, and in the latest version of the HPCG list (released at the same time as the Top500), ARCHER came out as the 10th fastest machine with a HPCG performance of 0.0808 PFlop/s (around 5% of the HPL performance). It should be noted that the HPCG list is much smaller than the Top500 list, with only one tenth of the entries, but it does have some of the largest machines in the world on it. Furthermore, we actually managed to beat a larger comparible machine (the Edison system, which is a similar type of system to ARCHER but has around 10% more processors and is number 34 in the Top500 list), swapping positions when compared to the Top500 list. This is probably because we spent some time tuning our HPCG run to ensure we had maximum performance; we were using a tuned version of the HPCG benchmark provided by Intel.
Whether HPCG overtakes HPL as the benchmark of choice for large scale systems remains to be seen, but it will be interesting to watch how measuring and evaluating these systems progresses over the next few years. Indeed, HPCG isn't the only alternative measure of performance. For a number of years the Green500 list has used a similar approach to the Top500 benchmark but ranked computers by power efficiency for the benchmark (including things like the power required to cool the supercomputer as well as the power required to run the compute nodes and network) rather than overall performance. There is also the Graph500 list that uses a different set of benchmarks, primarily aimed at data-intensive work loads (rather than compute intensive operations like HPCG and HPL), to assess the performance of machines on operations that require searching large datasets or graphs.
Also, we tend to use a suite of application benchmarks, such as the PRACE benchmarking suite, when procuring HPC systems or running the acceptance tests for systems to ensure that the performance of a machine will match what users require. This enables tuning of benchmarking to the workloads that a specific machine is likely to encounter (we have good information on the codes that are generally run on a system like ARCHER so can tailor our benchmarks to ensure the majority of users get reasonable performance); however the drawback of this approach for a global ranking effort (such as the Top500 list) is that it does not produce a single number by which you can rank systems, hence the popularity of benchmarks like HPCG and HPL. One options for future benchmarking efforts could be to produce a "price comparison" style website which could let you rank machines on how they perform on a range of different benchmarks or applications, enabling people to evaluate which machine would be best for their specific application, although this would increase the number of benchmarks you'd need to run on HPC systems.
Adrian Jackson, EPCC