Software use on ARCHER2 - an initial look

Author: Andy Turner
Posted: 24 May 2021 | 11:11

Back in February, I reviewed the usage of different research software on ARCHER over a large period of its life. Now we have just come to the end of the first month of charged use on the ARCHER2 4-cabinet system I thought it would be interesting to have an initial look at how people are using the new system in terms of research software use and job sizes.

First some numbers on the ARCHER2 4-cabinet system: The system is an HPE Cray EX Supercomputer; this subset of the full ARCHER2 system has 1024 compute nodes, each of which has two AMD EPYC 7742 64-core processors, giving 128 compute cores per node. There are 256 GiB of RAM per node and two 100 Gbps network interfaces (one per socket) onto the HPE Cray Slingshot interconnect. This gives an overall core count of 131,072 compute cores, slightly more than the 118,080 compute cores available on the full ARCHER system but contained in many less nodes (1024 on ARCHER2 4-cab vs 4920 on ARCHER). As the minimum unit of request on ARCHER2 is a full compute node this means that 4-cabinet ARCHER2 system actually has far less capacity in terms of number of jobs compared to ARCHER. This will be rectified when the full ARCHER2 system is available as it will have 5848 compute nodes, giving 748,544 compute cores in total.

We used the data in the ARCHER2 Slurm database to analyse the usage of the system for the whole of April 2021. The executable names from jobs in Slurm were matched to research software using a library of regex patterns that we have built up over the lifetime of ARCHER and ARCHER2. The analysis tool and the regex patterns can be found at: Github repo to analyse ARCHER2 usage

A summary of the method

We extract all Slurm subjobs that completed within the specified period.
We match the job name field (for Slurm subjobs this usually contains the executable name) against known research software applications using a regex library.
We use Pandas to analyse the data and summarise the use of research software on ARCHER2.

The distribution of use of ARCHER2 by research software for April 2021 is given in the plot below:  

 

 

The research software applications generally match those seen on ARCHER though the ordering is slightly different. Due to the short usage period analysed here, we do not yet have enough data to say if this is a real difference or not. Research software applications in the top 15 in terms of use that are common with the top 15 on ARCHER are: cp2k, VASP, LAMMPS, CASTEP, GROMACS, Met Office Unified Model, MITgcm, NEMO, Quantum Espresso and NAMD. Entries that appear in the ARCHER2 top 15 but not in ARCHER top 15 are: PDNS3D, HANDE, CESM, incompact3d and AxiSEM3D. We will report back on this picture once we have data for a longer period of use.

We have also looked at the job size distribution, the overall job size ditribution in cores (weighted by job CU use)is shown in the boxplot below (white circles indicate the mean job size but these are not generally a useful statistic here) with the corresponding numerical values in the table.

 

 

 

Code Min Q1 Median Q3 Max Jobs CU
Overall 1 48.0 512.0 1408.0 35328 325216 768873.8

We have also broken down the job size statistics (in cores) by the top 15 research software used on ARCHER2 in the period. The following plot and table provide details on these job sizes.

Code Min Q1 Median Q3 Max Jobs CU % CU
cp2k 1 1.0 1.0 64.0 4096 23677 237609.3 30.9
VASP 1 256.0 384.0 1408.0 32768 64231 121166.8 15.8
LAMMPS 1 512.0 512.0 1280.0 6400 9192 50634.0 6.6
CASTEP 1 128.0 256.0 2560.0 10240 132161 36330.5 4.7
Gromacs 1 128.0 1536.0 1536.0 4096 27196 31878.8 4.1
PDNS3D 1 3200.0 3200.0 15360.0 15360 121 20867.7 2.7
Quantum Espresso 1 1024.0 1024.0 1408.0 2048 8347 18406.5 2.4
NAMD 1 640.0 3200.0 3200.0 16384 1481 18318.6 2.4
SENGA 432 576.0 729.0 1458.0 1458 78 16879.1 2.2
FHI aims 8 512.0 1280.0 2560.0 3840 4757 16766.1 2.2
Met Office UM 1 1152.0 1152.0 1152.0 6272 928 13473.1 1.8
OpenFOAM 2 512.0 512.0 640.0 1024 804 13215.2 1.7
CRYSTAL 1 512.0 512.0 1024.0 2560 442 9217.2 1.2
MITgcm 1 128.0 128.0 128.0 400 528 8843.1 1.2
Nektar++ 128 512.0 768.0 1024.0 3840 348 8317.4 1.1

 

Looking at this data, we can note a few trends:

  • Some research software uses a range of different job sizes whereas some are much more resticted to particular sizes. For example, at least half of the large cp2k use is made up of a large number of single core runs - presumably some kind of parameter sweep or statistical sampling workflow. This could indicate that over this short period the use of a particular piece of software is restricted to a single user with a particular use case.
  • The only code that is currently being used at large scale is the PDNS3D CFD code with all jobs using at lease 3,200 cores and the largest jobs using 15,360 cores.
  • 7 of the top 15 by usage are materials modelling research software (and they make up 4 of the top 5), 4 of the 15 are CFD, 2 are Earth systems modelling software and 2 are biomolecular simulation.

The restricted size of the current ARCHER2 system (only the initial 4 cabinets out of the full 23 are available at the time of writing) clearly limits the ability of users to scale up their calculations to higher core counts and still see reasonable throughput. In fact, the ability to run very large jobs (larger than 256 nodes) is currently restricted on the system to improve the throughput for the user community, leading to an effective cap on production job sizes at 256 nodes. We expect the user community to be able to make more effective use of larger scale calculations once the full ARCHER2 system is available.

We will report back in a future blog post on how the research software use trends have evolved over the ARCHER2 service and also report back on trends associated with different research areas and programming language use on the service.

If you wish to analyse the data yourself we provide links to the dataset used for this analysis below along with a link to the Python script used to produce the plots and tables in this post. We also include the full summary table for all research software that we are currently able to identify through the analysis tool.

Full job size statistics

Job size distribution and usage in cores (weighted by job CU use).

Code Min Q1 Median Q3 Max Jobs CU % CU
Overall 1 48.0 512.0 1408.0 35328 325216 768873.8 100.0
cp2k 1 1.0 1.0 64.0 4096 23677 237609.3 30.9
VASP 1 256.0 384.0 1408.0 32768 64231 121166.8 15.8
Unidentified 1 192.0 1024.0 3584.0 35328 40261 90011.7 11.7
LAMMPS 1 512.0 512.0 1280.0 6400 9192 50634.0 6.6
CASTEP 1 128.0 256.0 2560.0 10240 132161 36330.5 4.7
Gromacs 1 128.0 1536.0 1536.0 4096 27196 31878.8 4.1
PDNS3D 1 3200.0 3200.0 15360.0 15360 121 20867.7 2.7
Quantum Espresso 1 1024.0 1024.0 1408.0 2048 8347 18406.5 2.4
NAMD 1 640.0 3200.0 3200.0 16384 1481 18318.6 2.4
SENGA 432 576.0 729.0 1458.0 1458 78 16879.1 2.2
FHI aims 8 512.0 1280.0 2560.0 3840 4757 16766.1 2.2
Met Office UM 1 1152.0 1152.0 1152.0 6272 928 13473.1 1.8
OpenFOAM 2 512.0 512.0 640.0 1024 804 13215.2 1.7
CRYSTAL 1 512.0 512.0 1024.0 2560 442 9217.2 1.2
MITgcm 1 128.0 128.0 128.0 400 528 8843.1 1.2
Nektar++ 128 512.0 768.0 1024.0 3840 348 8317.4 1.1
NEMO 1 456.0 456.0 1792.0 4864 5400 7533.5 1.0
HANDE 1 4.0 4.0 4.0 4 59 7245.4 0.9
CESM 48 48.0 48.0 48.0 288 232 7163.8 0.9
AxiSEM3D 15 1280.0 1536.0 1920.0 4480 106 5208.4 0.7
incompact3d 640 2048.0 2048.0 2048.0 4096 47 4744.1 0.6
GS2 64 1408.0 1408.0 1408.0 1408 143 4335.2 0.6
ONETEP 1 32.0 64.0 64.0 2048 317 3715.6 0.5
EPOCH 128 1920.0 2560.0 2560.0 2560 36 2826.2 0.4
Nek5000 72 72.0 288.0 576.0 864 46 2645.4 0.3
ABINIT 1 384.0 384.0 384.0 768 535 2386.9 0.3
Smilei 2 64.0 256.0 320.0 1024 116 2354.5 0.3
Code_Saturne 256 768.0 2048.0 3456.0 14080 43 1581.1 0.2
HemeLB 256 512.0 768.0 768.0 1536 17 1071.9 0.1
ChemShell 1 128.0 384.0 896.0 3072 177 887.8 0.1
USCNS3D 8 64.0 64.0 128.0 128 46 732.9 0.1
Python 1 1.0 1.0 256.0 4096 1745 704.5 0.1
RMT 4 384.0 384.0 384.0 2688 258 491.1 0.1
Amber 128 128.0 128.0 128.0 128 28 265.3 0.0
WRF 32 72.0 120.0 120.0 512 87 246.2 0.0
Fluidity 1000 1000.0 1000.0 1000.0 1000 19 227.1 0.0
OSIRIS 192 192.0 192.0 192.0 192 3 219.3 0.0
GPAW 128 128.0 128.0 256.0 384 299 116.3 0.0
PRECISE 1 1024.0 2048.0 2048.0 4096 227 93.3 0.0
HYDRA 1 18.0 256.0 512.0 1024 208 69.2 0.0
NWChem 128 1024.0 1024.0 1024.0 1024 27 34.8 0.0
TPLS 1 128.0 128.0 128.0 256 425 19.1 0.0
ECHAM 256 256.0 256.0 256.0 256 7 14.4 0.0
iIMB 256 256.0 256.0 256.0 256 4 2.9 0.0
FVCOM 1 256.0 256.0 256.0 256 7 2.4 0.0

Datasets and analysis scripts

Raw dataset (group usage anonymised, gzipped CSV)
Usage and sizes weighted by CU use (CSV)
Usage and sizes by number of jobs (CSV)

Author

Andy Turner, EPCC

 

Blog Archive