INTERTWinE project presented at collaboration workshop in Japan

23 January 2018

Last month I attended a collaboration workshop in Japan between the Centre for Computational Sciences (CCS) at the University of Tsukuba and EPCC. I was talking about the INTERTWinE project, which addresses the problem of programming-model design and implementation for the Exascale, and specifically our work on the specification and implementation of a resource manager and directory/cache.

The resource manager enables different runtimes to work together and share resources in a fair way. For instance a code might wish to take advantage of multiple programming technologies (either explicitly combining these technologies in the user’s code or by implicitly calling via a library which then makes certain assumptions.) The danger is that resources can become over subscribed, for example too many threads, and this results in a performance loss. The resource manager therefore marshals these at a high level and, not only statically distributes resources but also supports them being manipulated dynamically, such as runtimes lending out resources to other processes when they are idle.

The directory/cache is designed to support transparent (to the programmer) execution of tasks over distributed memory machines. Traditionally task-based models have been limited to a single memory space but this is really more of an implementation challenge in the runtime rather than any specific limitation of the paradigm. Efficiently moving data between nodes is the critical challenge here and our directory tracks what data is physically allocated to what memory space to then support reading and writing either locally or remotely. The cache is for optimisation;  the idea being that a specific piece of data might be used frequently so ideally we will just issue communications once to retrieve it and maintain a local store for as long as possible. As I say, this is entirely transparent to the programmer as it is intended that the directory/cache will be integrated with the runtime. Our reference implementation also contains numerous transport mechanisms which means that GPI-2, MPI RMA and BeeGFS can be swapped in and out trivially as the technology for physically moving the data around.

It was really interesting to hear about the projects that staff at CCS are currently engaged in: from optimising FFTs to modelling the pollution in cities there is a real variety of research being conducted in Tsukuba. We have potential overlaps in detail and this idea of recasting problems in the task-based paradigm, without the traditional limitations of a single address space, is attractive.

For my own part I am motivated by our work done by an in-situ data analytics code (presented at CUG 2017.) Whilst we built this data analytics on-top of a bespoke active messaging layer, fundamentally we were after the benefits of a task-based model but looking to run these tasks in a distributed memory environment. I have realised that the developments done in INTERTWinE, and specifically the technologies of WP4, mean that it should now be possible for this active messaging layer to be replaced with a task-based model (such as OmpSs.) This will result in a number of advantages – from being able to rely on a more mature technology to potential performance and scalability improvements.

We concluded the workshop by discussing next steps and how we might concretely work together on research projects to push forward the state of the art in HPC and HPDA. The next collaboration workshop between the two organisations will take place in December 2018, where we will be able to highlight some of the fruits of this collaboration.