Meet the ARCHER2 HPC Systems team

Author: Kieran Leach
Posted: 14 Apr 2021 | 10:46

The HPC Systems Team provides the System Development and System Operations functions for ARCHER2 - but who are we and what do we do?

We are a team of fifiteen System Administrators and Developers who work to deploy, manage, and maintain the services and systems offered by EPCC, as well as the infrastructure required to host and support all of EPCC’s services and systems.

One of our main responsibilities is managing the Advanced Computing Facility (ACF) where ARCHER2 is hosted - we do this in coordination with our colleagues at the University of Edinburgh Estates department. It takes a lot of cooling and power to support a system such as ARCHER2 and a lot of effort goes into making sure this is properly provided.

Beyond providing accommodation for ARCHER2 we also play a key role in both the development and deployment of the service, as well as its day-to-day operation. This broadly falls into three areas - infrastructure, deployment, and operations.

On the infrastructure side, we provide a number of services that allow the ARCHER2 system to operate. We provide things such as NTP servers, which ensure the accuracy of the system clocks, authentication servers which store and provide details of the various users on the system and various servers to monitor and record the state of the service. We also provide the networking to integrate the system into our site and provide connectivity to the broader internet. This year we’re going to be moving to a new network which will offer ARCHER2 100 Gbit/s connectivity to the broader internet (about 1500 times the average UK broadband) and 200 Gbit/s to other systems and services hosted at the ACF.

For the deployment of ARCHER2, HPE Cray provide the system in a “vanilla” state and we put in place customisations and configurations to best support our users. This includes deploying appropriate configuration to the system scheduler, implementing a variety of monitoring so that our site systems can keep track of the state of the system, deploying directory structures to various file systems, and implementing the ticketing infrastructure needed during operations.

Our final main area of responsibility is operations - this is a wide area of responsibility that is critical to keeping the service running. During working hours a member of the team is on-shift and responsible for monitoring the state of the system at all times so that we can deal with problems as they emerge. We’re responsible for the processing of tickets on the service - these tickets carry out all creations of user accounts, quotas and directories as well as any changes to these details. Tickets are created and issued by the SAFE system which you’ll access via a web portal to create and manage your account. We also troubleshoot and investigate a lot of the problems reported to the service desk by users. Hopefully you won’t run into any of course!

We’ve been working towards the launch of ARCHER2 and we’re excited for you to get your hands on everything we've been working on. We hope you are too!

Author

Kieran Leach, EPCC

Blog Archive