Ensuring continuity of service at the ACF
15 November 2022
EPCC’s Advanced Computing Facility (ACF) delivers a world-class environment to support the many computing and data services which we provide. This article takes a behind-the-scenes look at some of the activities the ACF team undertakes to provide the stable services our users expect.
Computer Room 3 (CR3) of the ACF has recently been the focus of a great deal of activity, with all of the main power distribution units’ (PDUs) supply cabling and sub-floor power supply cables being replaced. The ARCHER2 cooling distribution units (CDUs) and pipework have also been being cleaned internally by HPE.
The need for the PDU cabling upgrade arose from regulatory 5-year inspection and testing which showed that, although adequate for the original ARCHER machine, the electrical infrastructure was being pushed to its operating limit when ARCHER2 was at maximum capability.
The simplest way to accomplish this work would have been to switch off ARCHER2 for two months while the cables were replaced. As this was obviously not possible, the system power supplies had to be maintained throughout the works to ensure ARCHER2 remained operational. In total we had to install around 3km of large diameter power cables without disrupting any services.
Working with our electrical contractor and HPE we devised a way to connect enough 125 amp 3-Phase temporary cables to power ARCHER2 compute cabinets while we isolated and worked on each PDU in turn to install and connect the new cabling. Although difficult to programme and implement, the supply cables were all installed and connected with zero disruption to users of ARCHER2.
ARCHER2’s cooling and management cabinet supplies, which are connected to UPS power, were an additional consideration. To keep these systems operational, each of the UPS-fed PDUs were back-fed from one of the other UPS PDUs to ensure these more critical circuits were not interrupted during the works. This was not a simple task, but they have now been fully rewired to meet present and future needs.
All these works were completed with no disruption to ARCHER2 service provision thanks to great coordination and liaison between our electrical contractor, HPE, and the ACF site team.
Another electrical issue which came to light at this time was that the 125 Amp 3-Phase plug/socket connections to the ARCHER2 compute cabinets were susceptible to overheating under heavy load, a situation which has also been seen in other locations worldwide.
To address this we decided to not only replace the plugs with hard, bolted connections (all 69 of them), but to fully rewire all the sub-floor cabling for ultimate safety of supply. This again presented the same challenge: to complete the works without disrupting the service.
Working with the contractor, HPE, and ACF site team we devised a system to provide and connect temporary supplies to the compute cabinets with “hot swaps” of these critical connections, which allowed the existing cables to be removed, replaced, and the cabinets connected to the new cabling without any down-time. This approach proved to be acceptable to all stakeholders and was completed again with no loss of ARCHER2 service, which was a great achievement for all concerned. The electrical contractor’s staff deserve everyone’s thanks for working in very cramped underfloor conditions for several months.
While these electrical works were progressing in CR3, HPE identified an issue with ARCHER2’s cooling distribution units (CDUs) and their connected cooling pipework to the compute cabinets. This issue has also been seen in similar systems around the world. It was therefore decided to clean the insides of all this pipework.
To minimise disruption to the ARCHER2 service, it was agreed that four compute cabinets (a sixth of the total system) would be taken out of service at a time for remedial work, thus allowing work to proceed while ensuring the system remains operational, albeit at a reduced capability. Users may have noticed parts of the system being taken our of service and returned a few days later.
This work has been carried out by HPE specialists flown in from USA and Europe, supported by the site-based HPE team of Martin, Felipe, and Greg.
The works are now generally finished with only some final cleansing ongoing, which is expected to be completed in the coming weeks. The ARCHER2 service will then be back to full capability including the ability to consider new High Performance Linpack benchmark runs to verify its maximum performance now we have more experience with the system.
While all of this has been ongoing, we’ve also swapped out our Cerebras CS-1 for a new CS-2, racked an additional CS-2 for future use, ordered several more HPE ARCS cabinets for near future installation, started planning for a potential expansion of the DiRAC Tursa system, upgraded our site management network, continued planning and design for Exascale, and all of the other daily tasks to ensure the ACF functions smoothly, as well as looking forward to whatever the future throws at us.
About the ACF
The Advanced Computing Facility is EPCC’s high performance computing data centre. It is home to the ARCHER2 national supercomputing service, the Edinburgh International Data Facility, and other systems of national importance. Each of the ACF’s four computer rooms hosts specific HPC and storage equipment and is supported by associated plant rooms which provide dedicated power and cooling infrastructure for each room.
Images show: 1. hot and cold cooling connections to the rear of ARCHER2 compute cabinets; 2. some of the removed 125 amp 3-Phase plugs.