Data decomposition reshape library
Posted: 10 May 2013 | 10:00
In my previous blog post I said that I was working on a library to move data between different data decompositions.
In many cases it is easier for a programmer to work with a global coordinate system that reflects the overall data in the program. This is the approach taken by many PGAS languages and some parallel libraries such as BLACS.
The programmer still wants to be in control over the data decomposition, but ideally this should be a separated concern than can be changed without forcing a complete rewrite of the rest of the program.
In general any data decomposition can be represented at run-time by an object with the following interface:
- A method that maps global coordinates to a process rank
- A method that maps global coordinates to a local memory offset.
To start with I'm only considering decompositions where each dimension of the data-set is decomposed independently. I can therefore represent the decomposition along each dimension as a separate object and combine them using a set of processor-rank and memory-offset stride values. This is still capable of representing a far more general set of decompositions than most parallel libraries. The downside is that these general decompositions are a little more expensive to use.
To mitigate this, I use an interface which uses decomposition descriptors to build re-usable communication plans for switching between data decompositions (essentially these are lists of MPI datatypes corresponding to the necessary messages). Any additional overhead only takes place in the initial planning stage and should have little impact on the overall performance of the code. As an added bonus virtually the same code can be used to build MPI-IO file-view datatypes to support parallel IO to the different decompositions.
I'm building this library to make it easier to explore new algorithms for distributed FFTs. My first test was therefore to reproduce the capabilities of the distributed FFT provided by the FFTW-3 library on HECToR. The parallel FFTW library only supports decomposition in one dimension so it is relatively easy to reproduce. The following graph shows the time to solution of two different sizes of 3D FFT. These tests were run with one MPI task per node but with multi-threading enabled within the local FFTs.
As can be seen from the graph, other than for small numbers of nodes my code is already performing better than the parallel FFTW library.