ARCHER code developers and presenting performance

Application performance

As part of the ARCHER Knights Landing (KNL) processor testbed, we have produced and collected a set of benchmark reports on the performance of various scientific applications on the system. This has involved the ARCHER CSE team, EPCC's Intel Parallel Computing Center (IPCC) team, and various users of the system all benchmarking and documenting the performance they have experienced.

The benchmark reports make interesting reading, but this blog post isn't about the performance people have seen (I'll blog about that shortly).

What was interesting reading the reports was the variation, both in application areas and in ways that people have chosen to present and discuss their performance. These reports are by no means a perfect cross section of the ARCHER user community or application usage on the system, they are partly just the people who responded to our requests for benchmarking data, and those engaged in actively evaluating the KNL processors.

Scientific areas

If you compare the breakdown of applications from the performance reports (6 CFD, 4 molecular dynamic/materials, 1 dynamics modelling, 1 statistics package, and 1 plasma modeling code) to the ARCHER usage data it becomes apparent that CFD is over-represented compared to usage of the main ARCHER system, and the materials science and climate/ocean modelling communties are unrepresented.

However, as pointed out by my colleague Andy Turner, it's likely that this difference is between users and developers on ARCHER. The majority of ARCHER users are not code developers, they use pre-installed applications and packages. On ARCHER, it looks like the developer community is much more active in CFD applications than other scientific areas.

Of course, it could be that the developers from the other scientific areas have access to KNL resources elsewhere and are doing benchmarking and developing on those, but given there is currently no charge for using the ARCHER KNLs, and we have a reasonable number available for testing, I would expect a good selection of developers to exploit this test bed.

Why CFD appears to have more active developer-users on a system like ARCHER is less clear, maybe CFD applications are a newer code area than some of the others on ARCHER so there are more areas under active developement. Maybe the involvement of industry in a wide range of CFD areas makes it more attractive to develop codes for CFD applications? Or maybe the engineering focus of CFD applications lends itself to more users developing their own codes?

Performance presentation

Another aspect of the reports that is interesting is the range of different ways performance has been presented, or different aspects of performance that have been investigated, even in this relatively small selection of benchmarking reports. We didn't really put any restrictions on performance presentation or provide guidelines for the way we wanted performance to be represented and discussed, other than asking that the raw timing data be provided in a table, as well as any graphical representation, to allow readers to see the source data for the performance reports.

Given we didn't specify the format we wanted you would expect some variation, but as they are all performance/benchmarking reports you'd also expect some commonality between reports. However, there are very few similarities in methods for presenting performance between the reports. You get examples of standard timing graphs when scaling to larger numbers of cores (ie Figure 1) but also discussion of number of iterations per second, or number of timesteps per time period (ie Figure 2).

Both of these essentially tell the same story, but have reverse scales on them (on timing graphs, lower is better when using more cores. For iterations per time period, higher is better). Then you get normalised performance (Figure 3), comparing the KNL with standard processors on a per core basis, or a per node basis (node to node is the way to go from my perspective), looking at energy and power used between the two systems, speedup, hyperthreads vs non-hyperthreads, different compilers, etc...

This just shows that, even for basic performance evaluations, there is an extremely wide range of different parameters that can affect performance and therefore that people like to investigate, and also because of this there are different ways to present results.

However, I do have a bit of a desire for a more standardised set of ways to present performance. While most of the graphs and pictures in the reports are readily understandable, the wide variety of performance presentation approaches, both in these reports and in the wide benchmarking and parallel computing literature, do make it hard to just pick up a paper or report and understand it straight away.

It is also possible to hide some nasty details by choosing to present results in a particular way (speed-up graphs are a common culprit for these types of performance crimes). So, if you're going to write about the performance of a code, or some benchmarking work you've been doing, my advice is to keep it simple. A basic runtime vs core count or runtime vs different benchmark options will likely present all the details you need your audience to grasp without being confusing or potentially misleading. They may not be the most exciting graphs around, but simpler is better, at least in my opinion.

As always, I'm interested in your opinions on this, so don't hesitate to get in touch if you disagree, or even if you agree. It's always interesting to get a wide range of views on things like this.