EPCC’s ARM system: comparing the performance of MPI implementations
MVAPICH is a high performance implementation of MPI. It is specialised for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE communication technologies, but people generally use the default module loaded on their system. This is important because, as HPC programmers, we often optimise our codes but overlook the potential performance gains of better choice of MPI implementation.
ARM in HPC is still very new, and the MVAPICH team has not yet finished fully tuning their library to the architecture. This is very important because MVAPICH contains many advanced algorithms which suit different situations, and it is likely that the rules about which algorithm to use for ARM are in need of more refinement. Bearing this in mind, it is impressive that MVAPICH performs so well.
EPCC installed a new ARM-based system called Fulhame earlier this year. With 64 nodes, each containing two 32-core Marvell ThunderX2 CPUs (4096 cores in total), this system is funded under the Catalyst UK programme to further develop the ARM software ecosystem for HPC. MVAPICH’s performance has been benchmarked and explored at length on traditional x86 HPC systems, but I was interested how it would perform on Fulhame.
Some very interesting patterns were highlighted and generally MVAPICH performed very competitively against the other implementations. For instance on Cirrus, our x86 system, it out-performed MPT consistently and in some cases very significantly. On Fulhame the performance patterns were more nuanced and complex. MVAPICH demonstrates significant performance benefits with 2D pencil decomposed FFT codes, which is because the AlltoAll collective significantly out performs OpenMPI or MPT. In other situations however OpenMPI or MPT performed slightly better, but OpenMPI must first be configured to select the correct communication protocol before the results come close to those of MVAPICH, which provides good performance out of the box.
The MVAPICH team at Ohio State University, USA, has been given access to Fulhame to explore the performance properties of MVAPICH on this system and ARM in general. As the default performance was really very good, I am excited to see what can be achieved with this tuning. I believe MVAPICH is a good choice for MPI implementation, and definitely worth using. Furthermore, the use of ARM in HPC is also a very exciting prospect and throughout my work with this system I found that not only did it perform well, but it also provided a stable and mature HPC ecosystem. I expect we will see significant adoption of ARM-based systems (likely with MVAPICH installed!) across the Top500 in the next few years.