Multi-network MPI on Intel Omni-Path
Posted: 17 Jul 2019 | 14:11
As part of the NEXTGenIO project we have a prototype HPC system that has two Intel Omni-Path networks attached to each node. The aim of having a dual-rail network setup for that system is to investigate the performance and functionality benefits of having separate networks for MPI communications and for I/O storage communications, either directing Lustre traffic and MPI traffic over separate networks, or using a separate network to access NVDIMMs over RDMA. We were also interested in the performance benefits for general applications exploiting multiple networks for MPI traffic, if and where possible.
However, our initial focus with the prototype was in experimenting with the Intel Optane DC Persistent memory and the functionality/performance it provides, so we've been using a single network for everything up until now. To rectify this we powered up the second switch and configured the prototype to utilise both sets of network adapters on both nodes, but this required some configuration on the user side to enable MPI jobs to successfully run in this multi-rail environment.
The configuration issues weren't straight forward to fix, or at least it was not straightforward to find the working fix, and so I thought it worth documenting here for the (unlikely) case that someone else is struggling to setup a multi-rail Omni-Path network installation for MPI applications. When the two networks first came up, single-node MPI jobs were running fine (likely because they were using shared memory within the node to undertake MPI communications), but multi-node jobs were failing, both using Intel MPI and OpenMPI.
For Intel MPI we were seeing failures like this:
Abort(1014586895) on node 64 (rank 64 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack: PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-32766, key=64, new_comm=0x7ffd9954e924) failed PMPI_Comm_split(489)...................: MPIR_Comm_split_impl(167)..............: MPIR_Allgather_impl(239)...............: MPIR_Allgather_intra_auto(145).........: Failure during collective MPIR_Allgather_intra_auto(141).........: MPIR_Allgather_intra_brucks(115).......: MPIC_Sendrecv(344).....................: MPID_Isend(345)........................: MPIDI_OFI_send_lightweight_request(110): MPIDI_OFI_send_handler(704)............: OFI tagged inject failed (ofi_impl.h:704: MPIDI_OFI_send_handler:Connection timed out)
MPIR_Init_thread(649).......: MPID_Init(863)..............: MPIDI_NM_mpi_init_hook(1202): OFI get address vector map failed Abort(1094543) on node 20 (rank 20 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
For OpenMPI we saw errors like this:
PSM2 returned unhandled/unknown connect error: Operation timed out PSM2 EP connect error (unknown connect error)
Both of these sets of errors are just reporting time-outs on various parts of the libraries, showing MPI processes cannot establish communications with remote processes. I should note this is for MPI libraries that are using libfabric, and ultimately the PSM2 library, for the underlying communication functionality.
After some trawling through documentation, the web, and experimentation the following set of environment variables ended up fixing the issues for us. For MPI jobs using a single rail either:
depending on which network is used (from a performance perspective it didn't matter which one was chosen). For jobs wanting to use both networks for MPI traffic the following flags were required:
export PSM2_MULTIRAIL=1 export PSM2_MULTIRAIL_MAP=0:1,1:1
For our networks we saw double the MPI bandwidth when benchmarking in the single and dual rail modes for large message sizes. That was from ~23GB/s to ~46GB/s for messages over 16KB using the IMB benchmark suite BiBand across two nodes with 48 processes per node. There was no significant change in latency performance for point to point messages. Whether this translates into improved application performance really depends on what size of messages the application is sending using MPI. For the benchmark there is no appreciable improvement in message bandwidth until sending over 128 per MPI message. Presumably that is where more than one network packet is being used for the messages, and therefore the multiple networks can be exploited.
The next stage will be configuring the routing on Lustre traffic across a specific network and playing with some benchmarks to see the impact of I/O traffic on application communications. I'll post another article once we have some performance data on that.