The Power Benchmarking Game
Posted: 2 Sep 2016 | 14:12
As we continue to scale our HPC systems, the energy cost of doing so becomes an increasingly large and potentially limiting factor.
One of the most important aspects of addressing the challenges of energy efficient computing is having a solid understanding of how various design choices, in both software and hardware, affect the overall energy usage of your system and application.
In his guest post, PhD student Blair Archibald discusses how the Adept project is contributing to knowledge in this area.
The Adept Project, coordinated by EPCC, attempts to add to the current understanding of energy efficient computing and address the various trade-offs between different design choices. By creating custom power-measurement hardware capable of sampling the instantaneous power at a rate of up to 1Mhz we can get highly detailed traces of the power profile of an application as it runs.
By comparing implementations, we can experiment with various different design choices and how they affect the overall power consumption of the system. Knowledge gained from these experiments can then be used to build a tool which predicts the (relative) energy requirements of an unknown application on a given system (potentially even one that doesn't exist yet!).
Some of the design choices which have been investigated so far in Adept are:
- Effects of Dynamic Voltage Frequency Scaling (DVFS)
- An energy/power comparison of parallel programming models eg OpenMP, MPI and Vectorisation
- System choices: processor, Out-of-order execution effects etc.
In this blog post we take the view of an applications programmer and set out to discover what effects the choice of programming language might have on the energy requirements of an application.
Adept hardware in the lab.
Programming languages benchmarks
Comparing programming languages is an extremely difficult task. What criteria can we use to say language X is better than language Y? Empirically we might say "language X computed the result in half the time it took language Y, so it is the better language". But this fails to account for a host of other factors, for example the time needed to develop and tune the application, or personal preference of programming language.
Empirical programming language comparisons have been done before. The Programming Languages Benchmarking Game (PLBMG) provides a set of toy benchmark problems written in various programming languages (parallelism may be used). This allows a Top Trumps-style tournament to take place by comparing implementations on a set of empirical criteria. The original PLBMG compares implementations by time to completion, memory usage, code size and CPU loads. Can we use this to tell us anything about the power/energy usage patterns of the different languages?
Using the Adept tooling, we have instrumented the following subset of the PLBMG benchmarks and a mix of compiled, just-in-time (JIT) compiled and interpreted languages:
The PLBMG can also be used to compare multiple implementations of an application written in the same programming language. In the results which follow, a tag of C - 1 means we are considering implementation number 1 from the set C programming language implementations.
The results below show the power consumption on a system with a quad core Intel i5 4670K processor (at a fixed 3.4Ghz clock frequency), 16GB of RAM and running Centos6. The hardware is instrumented to give power measurements for the CPU socket, DRAM (two sticks) and ATX power supply lines (12v, 5V and 3.3V)
Energy usage by language
To compare languages we combine data points across the full power trace in order to calculate the energy usage (by numerical integration). Each benchmark is run 5 times and the mean value of the test statistic is taken as a single data point. To ensure the suite runs in a reasonable time, an implementation may only run for a maximum of 400s each before it is excluded from the results.
The following graphs show the runtime and energy usage for each language accross each of our four benchmarks. Error bars represent the standard deviation from the sample mean.
As we might expect, given that energy is power usage over time, most results show that the energy usage tracks the runtime. Some cases however are more surprising, for example in the case of binarytrees the python3 implementation has a lower runtime than the ruby versions but it uses more energy overall. Likewise in spectralnorm the runtime for perl-4 is very similar to the racket implementations, but the energy usage is much higher.
The DRAM energy usage across the benchmarks is generally small. This is somewhat expected given that the benchmarks are compute heavy rather than memory bound and the fact that DRAM power is much lower (by hardware design) than the other components.
Another surprising result is that Java looks to perform (in runtime and energy usage) similarly to fully compiled languages such as C and C++. We might have expected the JIT compiler (due to JIT warmup) to increase the energy usage significantly, but this does not seem to be the case.
To see this more clearly we zoom into the mandelbrot case by considering only the C, C++ and Java cases. By using the adept hardware to measure an idle system (only the OS and background processes running), we have also removed the idle energy from the traces to show the CPU energy overheads of the implementation. As before the error bars represent standard deviation from the mean.
The results show the same trend as before, with energy usage following runtime. At this enlarged scale we can see that random energy/power fluctuations are relatively uncommon (small deviation from the mean) accros runs. This is surprising given the dynamic nature of modern computer systems (eg process migration, operating system events, random interrupts etc).
By comparing different implementations in the same language, it becomes clear that how you write your implementation can have huge implications for energy usage. In some cases this is much more important than language choice. For example, Java - 2 uses less energy than C - 2, even though the C implementations are, in general, more energy efficient than Java.
Detailed power traces
The Adept measurement hardware doesn't just let us get combined statistics for application runs but also allows us to study the full, high precision, power traces of a running application.
In the binarytrees results above we saw the interesting effect that the python implementation ran in less time than the ruby versions but the energy usage was much higher. To figure out why this might be the case we can study full the power traces.
The following graphs show the full power traces of the python and ruby-1 implementations for the binarytrees benchmark. Rather than plotting every individual power measurement (in this case we use a 500khz sample rate which generates a lot of data points per run!) we use a moving average to apply a basic form of smoothing to the signal to make it easier to see what is going on:
We see Python quickly jumps to using a whopping 60W of CPU power for most of the run, lowering very slightly towards the end. Without looking at the code, this sort of profile is indicative that parallelism is being used:
- The initial jump to 60W is the worker spawn (likely to be processes in the python case)
- Work is then performed on all the workers in parallel, showing the high power usage.
- Nearer the end of the computation work becomes sparse and some of the workers start finishing/sleeping, hence the dip in the power profile.
The ruby case (at this scale) tells us much less about the application other than the fact it can maintain near constant (average) power usage for the entire run. Compared to python it's likely to be using a single core only. The compute heavy nature of the benchmark may also seen by noting that there are no long power dips as we might expect when fetching large amounts of data from memory.
Checking the source code of these two implementations confirms that this is indeed the case, Python uses parallel processes and the Ruby version does not. It's amazing how much you can tell about an implementation just by considering power profiles.
These runtime vs. energy trade-offs (e.g should I use parallelism or not? and if so, how much should I use?) are at the heart of the energy efficiency debate and in many cases it's difficult to predict until you physically measure the power profile of the application. Cases such as this show that the argument that
faster implies better energy efficiency is false.
Using the Adept power measurement tool it is possible to gather data on how various design choices within a computer system effect the overall energy usage. Here we asked what effect language choice might have on energy usage when running a set of test applications. Although it is not appropriate to draw conclusions about which language is better based on empirical measures alone, the results indicate that compiled and JIT-compiled languages tend to have the most energy-efficient implementations.
So which language should you use for energy-efficient computing? To misquote the Programming Languages Benchmarking Game website:
Will your toy benchmark program be
faster more energy efficient if you write it in a different programming language? It depends how you write it!
Blair Archibald, EPCC intern
Blair is a PhD student in the Glasgow Systems Section (GLASS) at the University of Glasgow where he is researching high performance irregular parallelism. He is currently looking at case studies in computational algebra and combinatorial search as examples of these types of workloads. As part of his PhD, he is working with the Adept and Exaflow projects at EPCC to gain further experience in HPC and explore different application domains.
Top image: ThamKC, iStock