HPC hardware in 2016 and beyond
Posted: 19 Apr 2016 | 23:14
Anyone taking more than a passing interest in HPC hardware recently will have noticed that there are a number of reasonably significant trends coming to fruition in 2016. Of particular interest to me are on-package memory, integrated functionality, and new processor competitors.
On-package memory, memory that is directly attached to the processor, has been promised for a number of years now. The first product of this type I can remember was Micron's Hybrid Memory Cube around 2010/2011, but it's taken a few years for the hardware to become mature enough (or technically feasible and cheap enough) to make it to mass market chips. We now have it in the form of MCDRAM for Intel's upcoming Xeon Phi processor (Knights Landing), and as HBM2 on Nvidia's recently announced P100 GPU.
The main advantage with on-package memory, or stacked RAM, is that it provides much higher bandwidth to the memory than traditional main memory. The memory chips are connected by high performance buses that can be accessed independently. Having the memory on the same package as the processor, with much shorter connections than if you are using traditional memory, also reduces the cost of accessing data as less time and energy needs to be spent sending the data from the processor to the memory and vice versa.
However, on-package memory is not generally faster than the standard main memory we are currently using. That is, the latency (or time spent) of accessing a single piece of data is not smaller for on-package memory compared to external, main, memory. In fact, access can be slightly slower, as there may be extra functionality (logic circuits for the routing and selection of communication channels/data) involved in the access of this high bandwidth memory.
But for a large range of applications this does not matter as most scientific simulation applications tend to be memory bandwidth bound (performance dependent on how much data can be streamed to and from memory) rather than memory latency bound (performance dependent on how quickly an individual piece of data can be accessed), or even compute bound (performance limited by how quickly instructions can be executed by a processor).
This is because the type of applications we work with tend to exploit the spatial and temporal locality of data (re-use data often, or use data that is close in memory to other data you have recently used), which in general means performance is restricted by how much of that data can be loaded and stored when needed.
The other issue with on-package memory is that it is currently difficult to attach lots of it to the processor as the heat generated by the processor and the memory becomes an issue. This means we don't currently see processors with large amounts on on-package memory (16 GB seems to be the limit at the moment), meaning a chunky simulation will still need to store data in main memory.
However, the two accelerators currently using on-package memory also provide direct access to the main memory from the accelerator, meaning that both the on-package and main memory can be used from the application, giving a very large potential memory space for applications (depending on how much you want to spend on your main memory), although to access all that memory it's likely applications will need to be altered to be able to allocate data in both memory spaces.
It currently doesn't look like such memory systems are coming to standard CPUs any time soon, so altering applications for on-package memory will require maintaining a separate version of the application for normal processors, which is annoying, although the changes required for this type of memory usage don't look very complicated or invasive from what we've seen so far.
Integrated functionality on cpus also seems to be on the way, with Intel planning on-processor network and on-processor FPGA functionality with future processors, and AMD talking about the development of an APU for HPC applications with integrated CPU and GPU components (as well as HBM).
Of course, on-package memory is really just another aspect of this integrated functionality, and bringing network or FPGAs on-processor has similar benefits; it removes the need for signals to travel further and via energy intensive large copper connections (i.e. communications don't need to go off the processor, across a PCI bus, to get to the FPGA or network adapter).
However, from a processor manufacturers perspective it also has other benefits, such as ensuring you're buying all your technology from them (if the network and memory are integrated into the processor you're not buying them from anyone else).
I do wonder, though, whether Intel, AMD, and the other manufacturers, have fully considered the risks associated with these integrated approaches. The benefits are clear: reduced energy consumption and higher performance are both great aims (as well as locked in purchasers). But it does mean that the devices they are manufacturing are increasingly complex.
In the past it was pretty disasterous if you found a problem with a component (think Pentium FP division bug), but could be tackled by replacing that component. With integration, if your processor has a similar bug then you have to replace the processor and you're throwing away the working memory, network, FPGA, etc. that have all been attached to that processor. It's a much bigger deal, but I guess the temptation is always to design for success rather than considering failure (much like in software).
However, this is Intel's, AMD's, Nvidia's etc. problem not mine, so I don't need to worry about it, and time will tell if people have properly considered the impact of component failure when designing such complex, inter-dependent, systems.
The final trend that recently seems to have become more apparent is the increased competition in the processor space. We've been through a brief period, over the past 5 years, where there has really only been one processor manufacturer for chips used in HPC machines: Intel. Whilst a number of large machines have included Nvidia GPUs, they are still likely to have had Intel host processors with those accelerators. Indeed, since IBM stopped their BlueGene line and have not been pushing the Power processors, there really hasn't be a competitor for Intel in this space.
However, there are a number of processors that seem to be promising potential competition for Intel's next generation of server processors. IBM is back in the HPC business with Power9 processors that are adding the NVLink technology for high-speed communications with Nvidia GPUs. A number of manufacturers are also developing 64-bit server processors based on ARM designs, and AMD will be back with their APU processor. Whilst none of these are game changers for HPC, they do at least provide the potential for interesting hardware competition in the coming years.