Exploring Fujitsu’s A64FX CPU

2 December 2020

The release of Fujitsu’s A64FX CPU has been a high point in an otherwise disappointing year. This next-generation CPU is the brain in Fugaku, the supercomputer at RIKEN in Japan, which was number one in the June 2020 TOP500 list.

Since February, Fujitsu has given EPCC access to a development A64FX machine as part of an early-access programme. We have been exploring the performance of this technology applied to numerous HPC workloads.

The A64FX is very special. Not only is it a high-performance ARM-based CPU with 48 physical cores, but it also contains numerous technology advances such as ARM’s Scalar Vector Extensions (SVE), and the use of HBM2 high bandwidth memory. 

The high floating point and memory performance could be revolutionary for HPC, providing at or near GPU performance but still within a familiar CPU environment and without the need to rewrite codes.

Furthermore, Fugaku also contains Fujitsu's proprietary torus fusion (tofu) interconnect, which promises very low latency/high bandwidth communications, potentially significantly improving node-to-node communication performance. The early-access machine we used contained 48 nodes connected by this same tofu interconnect, and felt like a mini Fugaku. It used Fujitsu’s specialist compilers and, apart from having to ensure you were referencing the English rather than Japanese version of the manuals, was a really pleasant environment to use. 

Benchmarks

To explore the potential of the machine we focused on a variety of benchmarks, mini-apps, and full applications which included HPCG, Nekbone, Minikab, COSA, and OpenSBLI, CASTEP. These were run on the A64FX early-access machine and then compared against some of our own machines in EPCC, including our ARM machine with Marvell ThunderX2 CPUs, and NextGenIO with Xeon Platinum Cascade Lake. Generally speaking, performance was very good, and for some of our codes the A64FX performed extremely well against the other existing CPU technologies.

However, performance characteristics are not the same across all the codes we tested on the A64FX processor, with a number of benchmarks exhibiting slightly worse performance. To some extent, this is the interesting part, and identifying what does not run quite so well, and why, is the point of the early-access programme. 

It is important to highlight that to date we have focused on compiling and running directly, rather than any architecture-specific optimisations, aside from using the provided compilers and associated libraries, and as such there is likely significant opportunity to further tune applications to the A64FX. 

Having played with the A64FX all year, we think this is an impressive piece of technology. The fact that it is so easy to get applications running on it, without requiring any specific code-level changes, demonstrates the high level of readiness of the overall system platform surrounding the CPU. 

With Fujitsu making the A64FX more widely available and Cray supporting it in their platforms, this product and the associated technology such as SVE has an important future enabling the next generation of scientific workloads.