View from the storage side
Posted: 16 Aug 2017 | 15:59
I recently attended the 2017 Flash Memory Summit, a conference primarily aimed at storage technology and originally based around flash memory, although it has expanded to cover all forms of non-volatile storage technology.
Non-volatile memory is a big deal nowadays. It is memory that stores data even when it has no power (unlike the volatile memory in computers that lose data when power is switched off). Flash memory is a particular form for non-volatile memory, it's been used for a long time, and has had a massive impact on consumer technology, from the storage in your cameras and phones, to SSD hard drives routinely installed in laptop and desktop systems.
NEXTGenIO and 3D XpointTM
I was there because of our involvement in the (European funded) NEXTGenIO project, where we are collaborating on designing and building a prototype system using a new form of non-volatile technology, Intel and Micron's 3D XpointTM memory. This project involves the creation of a prototype system (mainly undertaken by Fujitsu and Intel) and also the design and development of the all software and systemware required to enable applications to use the new hardware.
The 3D XpointTM hardware is non-volatile memory in DIMM form. This means it can be connected directly to the memory bus, in the same way normal main memory (DRAM or DDR as it's often known) we use in current systems can be. However, it is likely to have much higher storage capacity than standard memory, and will maintain data on power off for re-use once a machine is started back up.
NVDIMMs, NVRAM, and SCM
Whilst being larger in capacity than DRAM it is likely to have higher latencies, and therefore (as it is on the same bus as the DRAM) lower bandwidths, than DRAM. Probably around 5-10x slower than DRAM, but much faster than current non-volatile storage (i.e. SSD or hard disk drives (HDDs)). This type of memory, and 3D XpointTM isn't the only instance (i.e. ReRAM, super capacitor backed DRAM, etc..), is often known as NVRAM or NVDIMMs because it uses the same form factor and interface as standard RAM DIMMs (i.e. main memory).
However, there is also another name becoming common for such memory: storage class memory (SCM). SCM was a big topic at the Flash Memory Summit, from presentations on the different types of SCM that are available or soon to be on the market, to strategies for creating storage systems from such memory, programming models for applications to exploit them directly, and filesystems designed to run on them.
One of the major benefits of SCM is that it can be accessed in a different way from current storage devices. Current disks, be they SSDs or HDDs, are generally accessed in block mode, i.e. data is written or read in chunks, often 4KB at a time. Block mode is generally undertaken through the operating system and storage drives so there is a significant latency to access an individual block (especially compared to memory accesses).
SCM can be accessed in block mode, but because it is connected directly to the memory controller in the system, it can also be accessed in direct access mode, also known as DAX mode. This enables individual bytes to be written and read in the same way as normal memory, although a few more instructions may be required to ensure the data is persisted (i.e. to take advantage of the non-volatile nature of the memory).
In a lot of ways it's similar to memory mapped file functionality, although without the memory size restriction that implies. Memory mapped files cannot be bigger than the DRAM you have in the system, whereas SCM in DAX mode does not consume DRAM and can be much larger than the available DRAM.
The increased performance of SCM over standard storage devices, and the ability to use DAX mode to access the hardware, means there is the potential for orders of magnitude improvement in performance for I/O operations (reading and writing data to persistent storage). However, it does require applications to be re-written to exploit the functionality; traditional file based I/O functionality needs to be changed to DAX operations.
Actually, you don't need SCM to get some of these benefits. People are already making and using devices that can be accessed in DAX mode using SSDs attached to PCIe interfaces. Also known as NVMe (NVM Express), this enables non-volatile memory to be accessed as if it was memory.
Of course, as these devices are attached over the PCIe bus they won't have the performance that true SCM should provide, but they are still faster than standard I/O devices and can offer performance benefits whilst we wait for SCM to become commercially available. We recently benchmarked an Intel Optane based NVMe device, and will be publishing the performance report shortly. Optane is the name Intel is using for SSDs built from 3D XPointTM memory.
The storage industry is talking about convergence when SCM starts to be included in systems; convergence between memory and storage technologies. Whilst there is a lot of hype around this topic, and some speakers were even suggesting it was going to signal a move away from the von Neumann architecture all our computers implement (I disagree with this assessment), it is clear that SCM is going to be a key part of future computing systems, provided the price is right!
How much SCM is going to cost is yet unknown (at least by me). For it to be adopted it needs to be cheaper (probably significantly cheaper) than DRAM (on a per GB rate), as it's going to be larger and slower than DRAM. As it's a new class of memory, it's likely to be expensive when it first comes to market, and it'll take a few years for the manufacturing costs to be reduced and the volume to grow enough to reduce unit prices.
Aggregation vs Disaggregation
The other topic that was quite prevalent at the conference was that of aggregation vs disaggregation of storage. Aggregated storage is where storage is installed in, or attached to, individual servers (I/O nodes). Each server has its own processors and memory, and a set of disks/storage devices attached, as shown in Figure 1.
Figure 1: Aggregated storage
Disaggregated storage has I/O nodes and the storage they a responsible for separated, with I/O nodes connected to the storage with some form of network (PCIe switch, Ethernet, etc...), as shown in Figure 2.
This distinction is only really becoming relevant now because prior to current technology it wasn't really possible to have I/O devices connected outside servers, or at least not cheaply. It was possible to do some level of disaggregated storage with fibre channel technology, but his was expensive and also does not allow full disaggregation (i.e. any I/O node can talk to any disk on an equal footing).
The advent of fast, scalable, networks that disks can be connected to, like PCIe switches and other proprietary solutions, means that large scale disk arrays can be constructed, and attached to arbitrary numbers of I/O nodes. Indeed, Intel was demonstrating a 1PB set of NVMe disks in a 1U form factor (ruler disks) that follows the disaggregated structure at the conference.
The benefit that disaggregation can bring is the separation of storage devices from the compute that is managing them, meaning storage can be scaled without having to scale the I/O nodes, or vice versa, for the particular application(s) the storage is being deployed for. It also allows flexibility with connections between I/O nodes and storage, and the type of storage deployed in the system.
Figure 2: Disaggregated storage
However, coming from a HPC perspective, I was a little puzzled by this topic, as we have worked with disaggregated storage for as long as I can remember, i.e. a global filesystem available to all compute nodes. In a multi-user, multi-compute node environment it's technically challenging not to have disaggregated storage, as you have to be able to handle file access from any node to any part of the filesystem.
Indeed, this is one of the difficult challenges that I see SCM bringing in the HPC context, it brings the potential for very high I/O performance but in an aggregated form, i.e. with SCM in compute nodes. How do you exploit that for on a shared system with a job scheduler and many users? This is what we're looking at in NEXTGenIO.
However, whilst we do use disaggregated storage in HPC systems, at least from an individual user's perspective, we don't tend to use disaggregated storage from an I/O system perspective, i.e. systems like ARCHER (a Cray XC30) have a lustre filesystem that is mounted on a set of I/O nodes, with each I/O node controlling a set of disks. If you want a larger filesystem you need to install more I/O nodes.
So, in that sense, disaggregated storage would be a new trend for HPC systems as well, and does offer the potential for tuning I/O performance or resources for workloads. For me, the exciting prospect is some level of software defined storage, where a parallel job can have a set of I/O resources assigned to it, with different jobs using different amounts of resources, as required.
This would enable the storage used for parallel simulations, or data analytics applications, to be provisioned as required, with heavy I/O jobs taking large shares of I/O resources, etc... This is some way off, but these are some of the issues we are looking at in NEXTGenIO (i.e. how do you enable such resource partitioning and tracking in job schedulers, how you enable DAX modes for parallel applications, etc...).
I see exciting things ahead in I/O for HPC systems, although I think the key will be getting as much functionality managed by the systemware, rather than application developers themselves, to make SCM and configurable storage easy to exploit for applications.
One of the biggest topics of the conference was 3D flash. This is a relatively new manufacturing method for creating much denser flash (traditional flash rather than SCM), and although some vendors are already shipping 3D flash chips, many more are planned and many more manufacturing plants are coming on line
3D flash provides the potential for much higher storage density, with Toshiba unveiling a 32 TB SSD disk utilising 3D flash at the show. I'd expect a lot more of this type of announcement in the near future, meaning significant increases in storage capacity of SSDs in the coming years.
There were also predictions that flash memory prices are likely to crash at some point in 2018. Discussions focussed around the premium that manufacturers are able to charge for flash memory at the moment, possibly because a large number of flash manufacturing plants (fabs) are being re-purposed to produce 3D flash at the moment.
The feeling was that when they come online properly there will be significant manufacturing capacity and capability that will drive down the price of flash devices, as they will need to produce large volumes to efficiently use the fabs. Sounds like it might be a good time to buy some large scale, flash based, storage.
Thanks to EPCC's David Henty and Michele Weiland for corrections and suggestions.