Accelerating cloud physics and atmospheric models using GPUs, KNLs and FPGAs
Posted: 24 Apr 2019 | 11:51
The blog post below is based on the abstract of a talk at the PASC mini-symposium 'Modelling Cloud Physics: Preparing for Exascale' (Zurich, 13 June 2019).
The Met Office NERC Cloud model (MONC) is an atmospheric model used throughout the weather and climate community to study clouds and turbulent flows. This is often coupled with the CASIM microphysics model, which provides the capability to investigate interactions at the millimetre scale and study the formation and development of moisture. One of the main targets of these models is the problem of fog, which is very hard to model due to the high resolution required – for context the main UK weather forecast resolves to 1km, whereas the fog problem requires 1metre or less.
A major driver here is Heathrow airport. Due to its location it is prone to fog, which requires the spacing between aircraft to be increased. As an airport at 98% capacity, this increased spacing causes delays and has a financial impact, but crucially it is unable to accurately predict when the fog will clear. In such a situation you don't just need accuracy, but also for the forecast to complete in a timely manner (it's useless if the results of a day's fog modelling at Heathrow are available 3 months later!).
Exascale has a significant role to play in accelerating these models and providing new and important capabilities that not only benefit scientists but also wider industry. Whilst MONC and CASIM are prime models for simulating fog, they are both heavily computationally intensive and it is important to find ways to accelerate them. We have focused on three technologies: GPUs, KNLs and FPGAs. These are not mutually exclusive as they potentially suit different parts of the code and we are exploring the potential role that each has to play.
Initially we focused on using OpenACC for GPU acceleration. In MONC we were able to isolate specific computationally intensive kernels whereas in CASIM we were forced to offload the entirety of the microphysics model due to the tightly coupled nature of the code. While there were several caveats and lessons learnt, experiments on Piz Daint (P100 GPUs) demonstrated a significant benefit to using GPUs. This was most significant for CASIM, which has the most computationally intensive aspects, where using GPUs reduced the runtime by six times compared to the Haswell CPU.
We then focused on Knights Landing (KNLs). This technology has been deprecated by Intel but it is still useful to consider because of the high degree of vectorisation required to get good performance, a trend likely to continue in future generation Xeon CPUs. KNLs have much more memory readily available to them and we found that we were able to run much larger systems. This is important on the KNL, because smaller systems performed comparative to a node of ARCHER (Ivy Bridge), but when we go to much larger systems the KNL starts to out-perform Ivy Bridge more significantly and reduce the runtime by around 40%. However when the same experiment was run on a node of the Met Office's XC40 (Broadwell CPU) we found that this later generation CPU outperforms the KNL quite considerably. Whilst it doesn't look good for using KNLs to accelerate this model, it is still an important result.
So currently we have GPUs in the lead for accelerating the model, but there is a significant downside here – the energy consumption. GPUs require a very significant amount of energy and this makes them difficult (and expensive) to deploy in the field. The latest piece of work we have done, as part of EXCELLERAT (the European Centre of Excellence for Engineering Applications), is to port the MONC computational kernels over to FPGAs, using High Level Synthesis, and run them on our Kintex UltraScale which is mounted via a PCIe card. This is still work in progress, but we can demonstrate some interesting performance characteristics in comparison to the CPU, with each of our kernels roughly providing the floating point performance of a single Broadwell core at the moment. Undoubtely there is more performance headroom we can exploit at the kernel level, but what is currently hitting us with this FPGA version is the cost of data transfer (DMA to and from the PCIe card) and we have a few ideas how to ameliorate this.
The result of this work, driven by real world use-cases, is insight into the applicability of three major technologies available for accelerating HPC codes for atmospheric modelling.
Nick Brown, EPCC