Are we underestimating the real challenge of Exascale?
Posted: 28 Jun 2013 | 10:13
Reaching the Exascale is rightly posed as a combination of challenges related to (i) energy efficiency, (ii) heterogeneity and resiliency of computation, storage and communication, and (iii) the scale of the parallelism involved. Many discussions about Exascale focus on the first two challenges. This is understandable – building an Exascale system with today’s most energy efficient technology would still require around 480 MWatts. Scaling the number of cores for Titan and Sequoia results in 32 million and 96 million cores respectively but of course the Titan core count ignores the GPGPU cores. This is almost a hundred-fold increase in parallelism with all the inherent problems this will bring. Building a resilient machine with acceptable power demands is therefore impossible today.
The acceptable power envelope for the first Exascale systems was originally assumed to be 20 MWatts. In recent years, many experts have viewed this as a very aggressive target and 50-60 MWatts is viewed as more realistic. There are many technology options to get to this target but most involve: slower processors and slower memory coupled to aggressive power management strategies. What this will mean in reality is almost certainly much higher core counts than the simple scaling exercise above would indicate. Indeed, if total parallel threads are considered (including all layers of heterogeneous parallelism from accelerator cores upwards) we may have to develop applications for up to one billion parallel threads.
Software for Exascale
The CRESTA FP7 project, which I lead, is considering the software aspects of the Exascale challenge. We split the software for Exascale into software applications – the actual modelling and simulation we want to do on Exascale sytems – and what we call systemware, which includes the operating system software, tools, compilers, debuggers, libraries etc. which software applications will need to work with on these machines. CRESTA is focussing its work on a small set of six HPC applications that are widely used today and represent the sort of codes that will have to run on Exascale systems. Over the past 20 years we’ve managed to cope with each new generation of hardware through incrementally improving our codes. Solvers have been optimised, tweaks to numerical performance and communication models have been made. But mostly, codes – and coders – have coped. It would be unfair not to note at this point that the Petascale is still out of the grasp of many codes but many happily execute at the 100 Teraflop scale.
I believe that the problems that we’ve seen at the Petascale with regard to the scaling of many codes are insurmountable if we take the incremental change approach at the Exascale. Looking at the CRESTA codes, it is highly unlikely any of them will scale to the Exascale, even allowing for weak scaling (through increased resolution of the model under study) using incremental improvements. This means we need to think about disruptive changes to codes in order to meet the challenge.
Moving modelling and simulation forward
However, simply changing a solver or some other disruptive change to an existing code will not be enough. We simply do not understand how to compute using one billion parallel threads (except perhaps in trivial cases). It requires us to completely rethink how we simulate our physical world using this much parallelism. The problem goes to the foundations of modern modelling and simulation – we need to think beyond the tools we have today and invent new methods to express the mathematical descriptions of the physical world around us, on these and even larger systems in the future. Only by doing this will we move modelling and simulation forward for the next 20 years. This is the real challenge we face at the Exascale.
This post originally appeared in the ISC'13 blog on the ISC events website.