Global or local - which is best?

Author: Adrian Jackson

Posted: 9 Oct 2019 | 17:30

Selfish performance

Sharing of resources has challenges for the performance and scaling of large parallel applications. In the NEXTGenIO project we have been focusing specifically on I/O and data management/storage costs, working from the realisation that current filesystems will struggle to efficiently load and store data from millions of processes or tasks all requesting different data sets or bits of information.

An example of the challenges at current scales is Figure 1 where we timed I/O from a benchmark application on ARCHER across a number of different runs at different core counts (weak scaling) to see the variation in I/O cost.

Figure 1: I/O performance variability across a range of process counts and repetitions.

We can see that there can easily be a factor of 10 difference between the performance of the fastest and slowest runs across all the process counts used. The application is performing the same I/O in each case, so the variation experienced has to be associated with the shared nature of the parallel filesystem (and the network connecting the compute nodes to that filesystem) being used. Contention for shared resources in a system can significantly impact application performance, or limit application functionality (ie necessitate applications undertaking less I/O than is desirable to mitigate these costs).

One approach to mitigating such problems is to provide higher performance resources to match application demands, such as burst buffers. These are filesystems using faster hardware and configurations that allow higher application performance but are too expensive to be used for the whole storage requirements of a system. However, these are still external shared resources that cannot entirely remove sharing costs or drawbacks.

Figure 2: External filesystems and burst buffer resources in HPC systems

Our approach is to investigate whether the performance benefits of creating a system with storage resources within compute nodes would be sufficiently large to outweigh the functionality issues associated with having local storage. This research has been enabled by the new generation of non-volatile, persistent, storage/memory technology that has recently made it to market, particularly Intel's Optane DCPMM. This byte-addressable persistent memory (B-APM) has the potential to provide very high I/O bandwidth and low latency inside individual compute nodes, as well as enable applications to move from an I/O focus to a memory focus.

The NEXTGenIO prototype system has enabled us to start investigating some of the performance and functionality questions associated with this new memory technology and its inclusion directly in compute nodes. We have been undertaking a wide range of different performance experiments, from looking at the large memory space functionality that this B-APM provides, to direct access from applications using B-APM as memory, to applications and benchmarks treating the hardware as if it was a storage device and writing files directly to the memory in compute nodes.

One of the recent benchmarks we have been running is IOR, the parallel I/O benchmark originally developed at Lawrence Livermore National Laboratory. IOR has many different benchmarking modes, but for this test we were focusing on what is often called "IOR Easy", benchmarking parallel I/O where each process creates and uses its own file.

The prototype compute nodes have two 24-core processors each with 96GB of DRAM and 1.5TB of Optane DCPMM (giving 192GB of DRAM and 3TB of DCPMM per node). This let us run 48 MPI processes per compute node, each with their own file stored on filesystems mounted on the B-APM. We had to modify IOR slightly to enable it to exploit the in-node filesystems, otherwise performance suffered from non-uniform memory access (NUMA) effects (there will be more on this in a later blog post). However, apart from that it was a standard IOR run.

Figure 3 shows that the in-node B-APM can give extremely high performance, with over 1.8TB/s aggregate read bandwidth on 34 nodes. It is also evident from the graph that there is an asymmetry between read and write performance with this hardware. The write performance on 30 nodes is around 300GB/s, approximately 6x slower than the read, but this is what is expected from the hardware (Optane DCPMM has a longer write cycle than the read cycle because of the nature of the memory).

Figure 3: IOR Easy results using fsdax filesystem on the NEXTGenIO prototype

To put this in context, the top system on the current IO-500 benchmark list has an IOR Easy read bandwidth of 521GB/s and a write bandwidth of 336GB/s using 512 compute nodes. The NEXTGenIO prototype can achieve that read bandwidth with 12 compute nodes, but does not quite reach that write bandwidth with 34 compute nodes with the benchmarking configuration used here (we would require around 40 compute nodes to achieve that write bandwidth when using 48 writing processes per node).

However, if we reduce the number of processes running on a node from 48 to 24 and re-benchmark, we can improve the achievable write performance (albeit at the expense of reducing the read performance). Figure 4 presents the read and write performance of the benchmark when underpopulating nodes, and we can see around 400GB/s write performance and 1.5TB/s read performance in this configuration, easily exceeding the write and read performance of the top entry on the IO-500 list.

Figure 4: IOR Easy results using fsdax filesystem on the NEXTGenIO prototype with 24 processes per node

This demonstrates the hardware has significant potential to provide very high I/O performance, even when using traditional file interfaces. What's more, we could have run multiple IOR benchmarks at the same time, using a different set of nodes each, and seen no difference in the performance achieved by each individual benchmark run. Five concurrent six-node benchmarks would achieve the same aggregate performance as one single thirty-node benchmark, which isn't generally true on traditional parallel filesystem.

However, the real potential in this form of Optane memory is not for hosting files, but it is that it allows memory/I/O access without requiring access through the traditional file interface that requires operating system involvement and data written, or read, in blocks. An application that can be re-engineered to exploit that functionality can achieve closer to the benchmarking limits of this memory, namely around 100GB/s read bandwidth per node and 20GB/s write bandwidth per node for the setup we are using on the NEXTGenIO prototype.

How you exploit the memory without writing data to files, what impact NUMA effect has on performance, and using Optane DCPMM for very large, volatile, memory spaces, will all be covered in future blog posts, so keep an eye out for these over the next few week. If you're going to Supercomputing 2019 and fancy learning how to program persistent memory as well as getting hands-on with the NEXTGenIO prototype, check out our half-day tutorial on Practical Persistent Memory Programming.

Author

Adrian Jackson, EPCC

Adrian on Twitter: @adrianjhpc