Software optimisation papers for SC13
Posted: 23 May 2013 | 17:00
I've just finished working on two papers for this year's Supercomputing conference, SC13, which is going to be in Denver, Colorado, from the 17th-22nd November. EPCC will have an official presence, with an EPCC booth on the exhibition floor and a number of staff participating in the technical and education programmes. I thought this a good opportunity to write up some recent work I've been undertaking.
Figure 1: Power consumption of the code across a range of systems when increasing the number of nodes used. The dotted lines are the best results from the optimised MPI code.
The first paper outlines a collaboration with Dr. M. Sergio Campobasso, where we have been working to optimise a CFD code called COSA. COSA is primarily designed to investigate frequency-domain Navier-Stokes simulations, as opposed to traditional time-domain solutions used by most CFD codes, and as such implements a harmonic balance CFD solver (along with traditional time-domain and steady state solvers to enable comparison between the different methods of solving the Navier-Stokes equations).
The code has both MPI-based and hybrid (mixed MPI/OpenMP) parallelisations to enable users to complete simulations quickly. However, there were a number of inefficiencies in both these parallelisation which we have addressed, including coalescing MPI point-to-point and collective messages, combining MPI-I/O operations, hoisting OpenMP routines and ensuring the memory initialisation of the OpenMP parallelisation is optimal. We evaluated these optimisations on a range of different HPC systems, including HECToR (the UK national HPC machine which is hosted by EPCC at Edinburgh), and a large BlueGene/Q system.
One of the most interesting parts of the work was to evaluate the relative performance of the code across the systems, which cannot be done using just the overall time to solution for a given problem (as the hardware of each system is different). We used a measure of the nominal power per iteration of the simulation (nominal as it's based on reported power consumption of the computer systems when running the LINPACK benchmark, which is used to calculate the watts per node of the system) to evaluate the overall performance across the different systems. As shown in Figure 1, the BlueGene/Q system provides much better power vs performance than the other systems, and also the hybrid parallelisation enables improved utilisation of the BlueGene/Q hardware. This enables us to reduce the runtime by around 3.5 times, compared to the MPI code for the same power consumption. For further technical details, see the pre-print of the full paper.
The second paper, undertaken in collaboration with Pär Strand for Chalmers University, outlines some research we are working on to investigate methods of optimising communications for exascale computers. MDMP (Managed Data Message Passing) is a new parallel programming approach that aims to provide users with an easy way to add parallelism to programs, optimise the message passing costs of traditional scientific simulation algorithms, and enable existing MPI-based parallel programs to be optimised and extended without requiring the whole code to be re-written from scratch. MDMP utilises a directives-based approach to enable users to specify what communications should take place in the code, and then implements those communications for the user in an optimal manner using both the information provided by the user and data collected from instrumenting the code and gathering information on the data to be communicated at runtime. We demonstrate that for certain message patterns our approach can provide optimised communications when only a small amount of data is being sent, with the benefits increasing with message sizes. A pre-print of this submitted paper is available.