Benchmarking the Oracle bare metal cloud for DiRAC HPC workloads

13 June 2020

We have become an accredited processor under the UK Digital Economy Act, following an audit process conducted by the Office for National Statistics. EPCC is one of only seven UK organisations to gain this accreditation.

Acknowledgements

Although I wrote this blog article, the people who did the real work are members of the DiRAC RSE team and the Oracle team.

DiRAC RSEs:

Michael Bareford, EPCC, University of Edinburgh
Alexei Borissov, University of Edinburgh
Arjen Tamerus, University of Cambridge

Oracle HPC Team:

Andy Croft
Stuart Leeke
Arnaud Froidmont

Thanks to Oracle (particularly, Paul Morrissey) for arranging access to the resources.

Introduction

On-premises large-scale HPC and commercial cloud both perform similar roles for academic researchers: providing access to advanced computing capabilities that would otherwise not be available and opening up new opportunities for research. While commercial (and public) cloud offerings have managed to achieve a significant amount of academic use for high throughput and other trivially parallel research workloads, they have not yet managed to have the same success for tightly-coupled distributed memory parallel jobs that dominate the use of large-scale, on-premises HPC and supercomputing services. This has generally been due to a few factors, including:

A perception (which may be true or false) that commercial cloud systems cannot offer the required level of performance, particularly in terms of parallel interconnect performance and parallel I/O subsystem performance to sustain these HPC workloads.
The model of use of cloud-based resources is very different from the typical user experience on an HPC system, which generally consists of remote SSH access without access to elevated privileges and with a large amount of supporting software (compilers, numerical libraries, etc) pre-installed for their use by expert research software engineers (RSEs).
The cost of access to high-performance instances or shapes in the commercial cloud has been prohibitively expensive when compared to the cost of access to on-premises HPC services.

Recently DiRAC and Oracle put together a short proof-of-concept project to evaluate the suitability of the Oracle bare metal cloud for DiRAC HPC applications. In particular, we were interested in addressing the three issues described above.

Address issue 1 by looking at the performance of DiRAC workloads on the HPC shapes that provide MPI implemented on top of the RDMA over converged ethernet (RoCE) interconnect now available within the Oracle Cloud Infrastructure (OCI).
To understand issue 2 particularly for the Oracle cloud: assess how easy would it be for a typical DiRAC user to use the service currently available for their research based on the skills that they have.
Evaluate potential cost models for the Oracle cloud for DiRAC use of the OCI.

In this blog post I will summarise our progress with points A and B, and leave C for another time (although it is the most important point in some ways). First, I will summarise the OCI for HPC that we used in this project; then I will look at the DiRAC application benchmark suite. This will be followed by some comments on usability and performance before summing up at the end.

Spoilers!

I will start with a look at HPC on OCI. Actually I will start with highlights from the work for people who do not want to read right through the post!

Performance of DiRAC multi-node HPC applications on OCI up to 32 nodes is generally good and in line with what we would expect for an Infiniband-based HPC cluster. There are variations in performance, but these are no different from what you might expect across different on-premises HPC systems.
Parallel I/O performance based on BeeGFS installed over solid state storage is excellent.
The use model is very different from typical HPC systems with, as it currently stands, a user expected to be comfortable with Linux system administration tasks (across multiple nodes) and happy to install their own compilers, recompile MPI, etc. This is not a good fit to typical DiRAC users. A small amount of work in pre-installing software in the HPC shapes would substantially reduce the barriers to use.
1. Oracle performed all the cluster configuration and provisioning for us. (Managing this ourselves would have presented another hurdle to access.) Note that there are tools available to help automate this and simplify this step.
Currently, getting access to clusters in OCI larger than 16 nodes with the RoCE interconnect is challenging.
Both we (DiRAC) and the Oracle team had a really constructive working relationship and gained a lot of understanding of how the HPC world looks through each other’s’ eyes.

HPC on Oracle Cloud

The solution we used for our investigations was based on the BM.HPC2.36 HPC shape which has:

2x 3.0GHz Xeon 6154 (Skylake), AVX2: 2.6GHz, AVX512: 2.1GHz
384 GB DDR4-2666
6.7 TB NVMe local storage
Mellanox ConnectX-5, 100 Gbps network interface cards with RDMA over converged Ethernet (RoCE).

For our initial cluster, we were provided with 16 of these nodes linked together with an NFS file system mounted on all nodes to hold shared data (executables, libraries, etc), the RoCE interconnect configured, and a bastion node was added to allow external access to the cluster.

Towards the end of the project we got access to a similar cluster but with 32 nodes, rather than 16. Finally, a 16-node cluster was created and a parallel BeeGFS file system created using the local NVMe storage on each node to test the parallel I/O performance of the cluster in the cloud.

All of the clusters had similar software environments. They came with:

Oracle Linux (derived from RHEL)
OpenMPI compiled for Mellanox ConnectX-5 RoCE with system GCC 4.8.5.

There was only one account provided on each cluster and this account was able to sudo to root.

DiRAC application benchmarks

DiRAC has recently proposed a set of HPC application benchmarks that cover the range of application type and research area typical of the DiRAC research community. The set of application benchmarks were chosen in consultation with the DiRAC community and are:

Application

Research Area

DiRAC Service

Primary Language

Grid

QCD

Extreme Scaling

C++

SWIFT

Cosmology

Memory Intensive

AREPO

Astrophysics

Data Intensive

RAMSES

Astrophysics

Data Intensive

Fortran

TROVE

Molecular spectra

Data Intensive

Fortran

sphNG

Astrophysics

Data Intensive

Fortran

Most of the benchmark applications have both strong and weak scaling cases. More details of the different application benchmarks can be found in the DiRAC benchmarking repositories:

Extreme Scaling: https://github.com/DiRAC-HPC/es-benchmarking-public
Memory Intensive: https://github.com/DiRAC-HPC/mi-benchmarking-public
Data Intensive: https://github.com/DiRAC-HPC/di-benchmarking-public

These public repositories are still a work in progress so do not yet have the full sets of results, but we are actively working to make them as complete and useful as possible. In particular we are trying to use an “open source” benchmarking approach of making all the results, details, and analysis scripts available to aid reproducibility and reuse value for the work that we have undertaken.

System preparation and porting

The clusters came with minimal HPC software and tools installed so, before we were able to start work on porting the DiRAC benchmarks, we needed to:

Arrange how we would work with a single user account and no scheduler present.

- This setup is a symptom of the difference in how cloud resources are used compared to multi-user on-premises HPC systems.

- (Note: with more time and system administration expertise available, we would have been able to set up additional user accounts and installed and configured a scheduler.)
Install more up to date compilers…
… and as a consequence of updating the compilers, recompile OpenMPI.

As there was no scheduler installed and only a single user account, we simply scheduled blocks of time to each DiRAC RSE team member to stop us interfering with each other’s work. As there was no scheduler to control the launch of MPI programs, this had to be managed manually using template scripts from Oracle to set up hostfiles to launch MPI processes on the correct compute nodes.

We installed more modern versions of GCC (version 8 via “yum install devtoolset-8”) to allow us to compile more performant versions of the DiRAC benchmarks. Having done this, the fact that the pre-installed OpenMPI was compiled with the (very old) system version of GCC caused issues for compiling Fortran MPI applications as the module file with the MPI installation was not compatible with applications compiled using GCC 8. This meant we had to reinstall OpenMPI (after obtaining the correct compile options from Oracle) using the updated GCC version. We also had to install the updated version of GCC on all compute nodes to make the appropriate library files available across the cluster.

Once the basic compilers and MPI libraries were in place, we could move on to porting the DiRAC benchmark applications. This was pretty straightforward and no different from porting to any other HPC system; other than you generally logged into a compute node directly to do the compilation and testing work. The lack of optimised numerical libraries provided with the cluster likely means that we were not getting optimal performance out of the compute nodes. For some of the benchmark applications it has also been observed that the Intel compilers give improved performance over GCC and so some performance was also potentially lost due to this difference as the Intel compilers were not available on OCI.

If the HPC images had come with the more up-to-date compilers and associated MPI library installed and in the default $PATH, and some more automated way to launch MPI programs (along with some other necessary bits and pieces which were missing: git, cmake, etc.) then the porting effort would have been much more straightforward and very familiar to DiRAC users. As it was, the setup required a bit more system administration (or devops) skills than the average DiRAC user has or would be comfortable with. A small change to the HPC images provided (or customisation for a user base such as DiRAC by an associate RSE team) could have a large impact in terms of usability of the OCI by DiRAC HPC users.

Performance

We are not going to present the full set of performance results here, that will be the subject of a future publication, but we will present selected results indicative of the range of performance seen. We also do not yet have the results for all the benchmarks on all of the DiRAC systems (we are working on this too!).

We are also still working on profiling these benchmark cases to understand the performance differences in more detail and will publish this in due course. The table below summarises the architectures of the different HPC systems used in benchmarking exercises.

System

Processors

Memory

Interconnect

Notes

Extreme Scaling (Tesseract), Edinburgh

Intel Xeon 4116 (Skylake Silver), 2.2 GHz, 12c

96 GB DDR4-2400

Dual rail Intel OPA

Optimised for interconnect performance

Memory Intensive (COSMA7), Durham

Intel Xeon 5120 (Skylake Gold), 2.2GHz, 14c

512 GB DDR4-2400 (only 4 memory banks populated)

Mellanox EDR

Optimised for memory capacity

Data Intensive (Peta4-Skylake), Cambridge

Intel Xeon 6142 (Skylake Gold), 2.6GHz, 16c

384 GB DDR4-2666

Single rail Intel OPA

General-purpose HPC

Data Intensive (DIaL), Leicester

Intel Xeon 6140 (Skylake Gold), 2.3GHz, 18c

192 GB DDR4-2666

Mellanox EDR

General-purpose HPC

OCI

Intel Xeon 6154 (Skylake Gold), 3.0GHz, 18c

384 GB DDR4-2666

RoCA (100 Gbps)

First we look at AREPO, the standard astrophysics application used by many different groups within DiRAC. The plot below shows the performance and scaling of the AREPO strong scaling benchmark on OCI compared to two of the DiRAC HPC systems (higher numbers are better performance) as a function of node count.

Here the performance of OCI is similar to or better than the on-premises HPC systems, showing that the OCI and the RoCE interconnect performance are both more than adequate for this benchmark at these sizes.

Next we show the performance of Grid, a data parallel C++ library aimed at quantum chromodynamics (QCD) modelling and how it varies with node count. This is a weak-scaling benchmark and the performance is dependent on having the right balance between floating point performance and interconnect performance.

The OCI does not perform as well for this benchmark – with the on-premises DiRAC systems out performing OCI significantly in this instance. This is likely due to the increased latency of the OCI interconnect compared to the Infiniband interconnect. It is also worth noting that DiRAC researchers typically run their production runs using Grid on 512 nodes of Tesseract (12,288 cores), which is currently well beyond the capability that OCI can provide.

Next we look at the performance of the TROVE strong scaling benchmark. TROVE is used to generate rovibrational energies and spectra of general polyatomic molcules to provide comprehensive data on molecular opacities important for modelling atmospheres of exoplanets and cool stars. The plot below shows TROVE performance and scaling as a function of number of nodes.

As you can see, at low node counts the OCI performance is similar to the best of the on-premises HPC systems. As the node count increases, however, the scaling becomes poor compared to DIaL system. The COSMA7 on-premises system shows poor performance that is not currently well understood, work is ongoing to understand this issue. More analysis is needed to understand the differences in scaling properties on different systems.

The key message across all of the benchmark results shown here (and it is maintained for the other benchmark application that we have not shown) is that, for these node counts, the OCI performance is in the same regime as on-premises Infiniband-based HPC systems and so is providing a viable platform for multi-node HPC using DiRAC research applications.

Finally we take a brief look at parallel I/O performance in OCI. This was undertaken with a particular focus on the SWIFT application benchmark where parallel I/O is a key part of the application use (it is also key to other benchmarking applications). In OCI, the parallel file system implementation was BeeGFS built on top of the local NVMe storage with communication over the RoCE interconnect. This was compared to Lustre built on top of NVMe storage on the COSMA7 system. SWIFT has two key large file types: snapshot and restart files; in the benchmark run, the restart file was 994 GB and the snapshot files 370 GB. The performance comparison is shown in the table below:

System

Snapshot write time (BW)

Restart write time (BW)

COSMA7 (16 nodes)

53s (7.0 GB/s)

34s (29.2 GB/s)

Oracle Cloud (16 nodes)

59s (6.2 GB/s)

14s (70 GB/s)

You can see that the parallel file system on OCI performs similarly to the on-premises system for the smaller snapshot files and shows substantially better performance for the larger restart files.

Conclusions and next steps

This has been a really interesting project to work on and I think both DiRAC and Oracle have learnt a lot from the positive and constructive interactions throughout the work. This interaction has been one of the most important aspects of the project. Both partners have a much better understanding of where the differences are in their expectations and the skills at each end.

In terms of porting and usability, the current setup would be difficult for a standard DiRAC user (if such a thing exists) to use without support from someone with devops experience. There are some projects underway to make this step easier (see, for example, Cluster in a Cloud: https://cluster-in-the-cloud.readthedocs.io/) and a small amount of effort to provide more fully-featured HPC images would go a long way to addressing these concerns.

Performance compares favourably to the DiRAC on-premises systems, with the OCI cluster performance being in the range of “just another HPC system” for the node counts considered in this study for most benchmarks – both compute and parallel I/O. There are concerns over the availability of large amounts of nodes in OCI with the RoCE interconnect capability as we had some trouble getting access to a 32-node cluster.

In summary, from a technical standpoint, running multi-node DiRAC HPC applications in OCI can definitely see reasonable performance similar to traditional, on-premises HPC systems but the usability is not quite at the level that could be managed by a typical DiRAC user without some specialist help. Next, on to thinking about cost...