ARCHER Phase 2 User Meeting
Posted: 16 Jul 2014 | 18:47
Like most major HPC services, ARCHER will undergo a hardware upgrade during its operation to ensure the available computational resources are still sufficient for UK computational simulation requirements throughout the duration of the service. For the previous UK national service, HECToR, there were 2 major upgrades, taking the system from 11,328 cores (and a peak performance of 59 TFlop/s) when it first started, to 90,112 cores (and a peak performance of over 800 TFlop/s) for the last few years of the service.
As ARCHER has a slightly shorter duration than HECToR (ARCHER has a planned lifetime of 5 years) there will be only one upgrade to the service, probably undertaken at some point in 2015. EPSRC has allocated up to £12m for the upgrade, but the exact nature of what the upgrade hardware will be has not been decided, hence this user meeting, and a user questionnaire circulated to all ARCHER users.
Possible upgrade paths
At the meeting Cray, the hardware providers of the current ARCHER system, outlined four possible upgrade paths:
1. Upgrade by simply adding more nodes of the current ARCHER system (ARCHER is composed of nodes with 2 x 12-core Intel Xeon "Ivy Bridge" processors and 64 or 128 GB of memory per node). It should be possible to increase the computational through-put of ARCHER by approximately 50% with this option, and this upgrade could be provided as early as the first part of 2015.
2. Upgrade mainly with the current ARCHER hardware, but also purchase some nodes with accelerators in them. If the upgrade was to happen in the first part of 2015 the available accelerators would be NVidia K40 GPUs (each accelerator node would have 1 x 10 core Intel processor and 1 x Nvidia K40 GPU).
3. Upgrade by adding nodes with the new Intel processor (Haswell) rather than the current processors ARCHER has. Intel's Haswell processor has faster memory than the current processors we are using (DDR4/2166 MHz), and double the L1 and L2 cache bandwidth compared to the current processors. It also has slightly improved floating point performance. Because of these hardware improvements it should be possible to increase the computational capacity of ARCHER by around 60%+ using these types of nodes (as opposed to simply upgrading with nodes using the existing processors). However there is some complexity to using these processors as to get best performance applications have to be specifically compiled for those processors and then those executables will not run on the existing processor. This would mean users would have to re-compile their codes for the different parts of ARCHER and potentially select which types of nodes they would like to run on.
4. Delay the upgrade until 2016 to enable upgrading with next-generation accelerators. This would potentially allow upgrading with even newer Intel processors, or the next generation of accelerators (GPUs or Intel Xeon Phis), but would mean the service was only upgraded for a shorter period of time (upgrading at the beginning of 2016 would only give 18 months of use before ARCHER officially finishes).
The meeting attendees then went on to discuss, in groups, some potential scenarios based on these options. The scenarios were:
1. Based around upgrade option 1, upgrade simply with Ivy Bridge nodes, but retain £1m of the upgrade money to do something else with.
2. Based around upgrade option 2, upgrade primarily with Ivy Bridge nodes but also with a significant portion of NVidia K40 GPU-based nodes.
3. Based around upgrade option 3, upgrade with Haswell processors (the next-generation Intel processors) rather than Ivy Bridge.
4. Based around upgrade option 4, upgrade with some Ivy Bridge nodes now but keep a significant portion of money for a later upgrade with next-generation technology (what this would be wasn't specified).
In the group discussions, scenarios 1 and 4 came out as being very similar, with the £1m left in scenario 1 being used to buy an interesting accelerator (or other technology) system to go alongside, but run as a separate service to, ARCHER. You'd have access to the data on ARCHER's main file systems and RDF, but it would be separate to the main machine. Scenario 4 was generally discussed as similar to that, but with a bigger portion of the money spent on the accelerator system and spent later, thereby getting access to newer technology.