Launching applications non-homogenously on Cray supercomputers

Author: Tom Edwards
Posted: 12 Apr 2013 | 11:31

The vast majority of applications running on HECToR today are designed around the Single Instruction Multiple Data (SIMD) parallel programming paradigm. Each processing element (PE), i.e. MPI rank or Fortran Coarray image, runs the same program and performs the same operations in parallel on the same or a similar amount of data. Usually these application are launched on the compute nodes homogenously, with the same number of processes spawned on each node and each with the same number of threads (if required). Running applications in this symmetrical “load-balanced” pattern is typically vital to maximising the scaling performance on large, uniform MPP supercomputers like a Cray XE6.

However, the reality is that many applications are not true SIMD programs, instead they are, at least in part, “Multiple Instruction Multiple Data” (MPMD) programs. For example, some may perform some actions on a subset of processes. This is potentially sub-optimal for users launching applications homogenously across the nodes, as decisions like how many PEs to place on each node are determined for all processes by those PEs with the greatest resource requirements. This is potentially a waste of resources as the majority of PEs may require far fewer resources. A common example is a small number of PEs that require more memory than others because they are collecting or distributing all data for the whole application. Users may be forced to launch their application with fewer PEs per node than absolutely necessary just to accommodate the additional memory requirements of this subset of PEs.

But help is at hand! Cray architectures like HECToR, provide a simple, flexible solution to launch applications with a non-homogenous distribution of PEs across nodes. It requires no changes to the application itself and means users who know about pre-existing, static load imbalance in their applications can design custom distributions that make best use of the available resources.

Aprun’s MPMD mode to the rescue

All this depends on aprun's MPMD mode, which allows multiple binaries to be launched as a single parallel job and all linked together under the same global communicator (e.g. MPI_COMM_WORLD in MPI). This is done by stringing multiple sets of aprun launch commands togther as a single aprun command, separating them with a colon (full details in the aprun man page). e.g.

aprun -n 16 -N 16 -d 2 binary1.exe :\

-n 16 -N 16 -d 2 binary2.exe :\

-n 32 -N 16 -d 2 binary3.exe

One of the important properties of this mode is that each different binary is launched on its own unique set of compute nodes. However, there is no requirement that the binaries be different for each set, or that any of the arguments like PEs per node (-N), depth of threads (-d) or strict memory containment (-ss) be the same across groups.

PE ids (i.e. MPI ranks numbers or Corray Image Ids) are assigned by evaluating the notation left-to-right, such that the first PE ids assigned (typically with IDs 0,1,2... by default) are from the left most process group (and so on). This means applications can be launched with a single binary but with a non-homogenous distribution of PEs across the node. e.g. launching the first two PEs on a single node allowing them all the memory on a  socket, while leaving the remaining PEs to be more densely packed on the nodes:

|------------- PEs 0-1 ---------------|

aprun -n 2 -N 2 -c 0,15 -d 2 bin.exe :\

|------- PEs 2-255 --------|

-n 254 -N 16 -d 2 bin.exe

or alternatively, PEs 128-129

|----------- PEs 0-126 -----------|

aprun -n 127 -N 16 -d 2 bin.exe :\

|-------- PEs 127-128 -----------|

 -n 2 -N 2 -c 0,15 -d 2 bin.exe :\

|------ PEs 130-255 -------|

 -n 127 -N 16 -d 2 bin.exe

Even greater flexibility with rank reordering

Initially this seems like it might only be useful when there are sequences of contiguous PEs with different resource requirements, or it may seem wasteful to run only single PEs on a whole  node (maybe some PEs need just twice the resources, not the resources of an entirely dedicated node). This technique can be made really useful by combining it with custom rank order files.

Rank reorder mapping allows users to define how individual PEs are mapped to the allocated nodes. This is usually used to maximise the volume of communication between PEs on the same node and minimise the traffic between PEs on different nodes. Combining this ability with non-homogenous launching means almost any placement on nodes can be constructed.

For example, maybe two PEs are required to have a socket each as above, but their ids are not consecutive, (in this example PEs 0 and 128). By including a specially constructed MPICH_RANK_ORDER file, and instructing aprun to use it, it is possible to place these two PEs on the each of the sockets of same node while still having the rest of the PEs with normal distribution. E.g.

$ cat <<EOF > MPICH_RANK_ORDER

0,128,1-127,129-255

EOF

And in your batch script add the line:

export MPICH_RANK_REORDER=3

Resulting in:

|----------- PEs 0,128 ----------------|

aprun -n 2 -N 2 -c 0,15 -d 2 bin.exe :\

|--- PEs 1-127,129-255 ----|

-n 254 -N 16 -d 2 bin.exe

Matching with Batch

One caveat is that this technique breaks the intuitive link between the way resources are requested from the batch system and the values used for the aprun command.

The PBS batch system on HECToR only allocates whole nodes at a time. Once a job is scheduled to run it is assigned a set of nodes on which aprun is authorised can launch applications. PBS calculates the number of nodes required from two figures, the total number of PEs requested, mppwidth, and the number of PEs per node, mppnppn.

It is very easy to accidentally request a different number of nodes from the batch system than the aprun command line requires. The recommended technique for requesting nodes from the batch system is to calculate how many nodes are each MPMD section supplied to aprun requires then to sum these values to arrive at the total number of NODES required. Jobs should then request that number from the PBS batch system by setting, mppwidth=<nodes>*32 and mppnppn=32 on a Cray XE6 Interlagos node.

Example Use Cases

This powerful combination of techniques opens up a wide range of possibilities for application placement. Users can create “resource islands” where nodes requiring different resources are placed (useful for IO aggregators or IO servers that buffer data and generate large volumes of IO traffic). These islands can be spread out at intervals throughout the code to ensure communication is distributed across the network.

It is also possible with this technique to pass certain nodes different values of OMP_NUM_THREADS at initialisation by launching wrapper scripts which have different values set, e.g.

aprun -n 16 -N 16 -d 2 omp2.sh :\

-n 32 -N 32 -d 1 omp1.sh

or using the env command,

aprun -n 16 -N 16 -d 2 env OMP_NUM_THREADS=2 a.out :\

-n 32 -N 32 -d 1 env OMP_NUM_THREADS=1 a.out

This could be used to try and limit any static load imbalance which can be determined at the start of the run. For example, assigning more threads to PEs which are known to take longer and fewer to those which are known to run more quickly.

Even where there is no resource requirement for moving processes between nodes, it may be useful to reorder processes to achieve a closer mapping between the PEs on the nodes and the application’s parallel decomposition.

See Also

Contact

Tom Edwards, Cray Centre of Excellence for HECToR