First phase of ARCHER2 installed at Advanced Computing Facility
Posted: 15 Jul 2020 | 15:41
ARCHER2, the new UK national supercomputing service, is a world-class advanced computing resource for UK researchers. The service is due to commence later in 2020, replacing the current ARCHER service.
The four-cabinet Shasta Mountain system completed its journey from Cray’s Chippewa Falls factory in the US to EPCC’s Advanced Computing Facility in July. This is the first phase of the 23-cabinet system of ARCHER2, the UK’s next national supercomputing service.
Moving these specialist systems and getting the right people here to install them is a logistical challenge at the best of times, but with the COVID-19 restrictions this was considerably more challenging than usual.
We are grateful to our colleagues at Cray/HPE for all their planning and perseverance! It is a huge step forward to see these systems on site.
Related to this, the ARCHER2 Test and Development System (TDS) is operational in Wisconsin, with the ARCHER2 team accessing this remotely. This has allowed testing and preparation work to get underway with user documentation, training courses and user application code support.
ARCHER2 will be a Cray Shasta system with an estimated peak performance of 28 PFLOP/s. The machine will have 5,848 compute nodes, each with dual AMD EPYC Zen2 (Rome) 64 core CPUs at 2.2GHz, giving 748,544 cores in total and 1.57 PBytes of total system memory.
ARCHER2 should be capable on average of over eleven times the science throughput of ARCHER, based on benchmarks which use five of the most heavily used codes on the current service.
ARCHER2 is provided by UKRI, EPCC, Cray (an HPE company) and the University of Edinburgh. It is hosted and managed by EPCC at the Advanced Computing Facility.
Gregor Muir, a summer intern at EPCC’s Advanced Computing Facility, describes the week-long installation of the first phase of ARCHER2.
Months of planning and work went into preparing for this week, including running large power supplies, installing cooling pipework, strengthening the flooring for the 3500kg cabinet, and vacuuming the computer room.
We were greeted by four enormous trucks and the Cray team who had helped ship and pack the ARCHER2 system. By 2pm everything was unpacked and sitting in the right place, which was impressive, considering the size of the cabinets.
Power connections were quickly made, but we had to wait for the arrival of the de-ionised water used in the system’s internal cooling before it could be switched on. However, the CDU (cooling distribution unit) was powered up.
The CDU can be imagined as two parts. One is attached to the site’s water supply, incoming at 16 degrees. This chilled water travels through a heat exchanger, effectively two interlinked radiators with entirely separate water supplies. The other half of the CDU pumps the system’s internal cooled water around, through the heat exchanger, maintaining a specified temperature without connecting its own water circuit to the site. This is for robustness and to separate it from the more expensive ‘water’ needed to cool the supercomputer.
While powering up the CDU, the site’s water was connected and proved to all be working perfectly.
The CDU work continued, with the de-ionised, additive-enhanced water arriving. There was also lots of configuring and some troubleshooting. By the end of the day, management and storage were sorted, and the CDU was operating on the site’s water supply and circulating the system’s supply.
The last day on site for the installation teams, which meant all trouble-shooting had to be finished. I was lucky enough to watch the diagnosing and repairing of blades. The first was done in less than two minutes, just a direct swap of two DIMMs with fresh spares. The workbench trolley that the blade was sitting on was hinged and, with the aid of a pneumatic arm, the blade was lifted from horizontal to vertical, ensuring ease of access for the blade lifter. The lifter itself was pre-programed to the height of each row of blades, allowing a perfect matching up through the use of magnetic sensors. Everything was beautifully engineered.
Load testing has begun. There will now be a period of training before the system is made available to users, and it will operate concurrently with ARCHER for a few months before the full 23 cabinet system is delivered and built.
For further information and photographs, see:
Short video showing the installation: https://bit.ly/2XeApIi
Lorna Smith, EPCC
Gregor Muir, Summer intern, Advanced Computing Facility