Causing a Storm in MPI: easier data processing for scientists
Posted: 17 Jun 2014 | 15:00
After several years of working with users who are not computer scientists (seismologists and geoscientists), we have realised two main points: these communities usually have problems that should be addressed with parallel computing, but they don't often have the skills and training to do so. We set out to build a programming library, Dispel4Py, that both enables users to easily write a description of a data-processing application and takes care of running that application in different parallel environments.
We decided to implement our library in Python as it is very popular with the seismologists we are working with in the VERCE project, which is building a forward-modelling application that uses specfem3d to produce geological models on HPC platforms. The result files are post-processed with a Python script, for example to generate image files or videos. With Dispel4Py we aim to enable users to write the postprocessing chain and run it in different environments without changing their code.
So far Dispel4Py has three different parallel mappings: Storm, MPI and multithreading. This means that users can design their applications independently of the parallel engine selected for running them, because Dispel4py maps to the parallel engine dynamically at runtime. Therefore, Dispel4Py enables the user to run an application on a Storm and MPI cluster without learning their languages. Instead, users build their applications from "Lego bricks" of tasks that they can assemble into workflows as they wish.
When we designed Dispel4py, our first choice for the parallel engine to run the applications was Storm. Apache Storm is a distributed realtime computation system that reliably processes unbounded streams of data. It is a dedicated system that executes graphs of computational nodes with data streaming between them. However, while Storm is very popular for data-intensive applications in business environments, not many HPC centres support it. Therefore we decided to also support MPI, which is well known and widely supported in HPC environments. Finally, there are many cases where users prefer to use their desktops for testing their applications before running them on the cluster. Nowadays, desktops are designed with many cores, so we added support for multithreading for testing on desktops too.
To execute a Dispel4Py graph with Storm, it is translated into a Storm topology using a straightforward mapping. Storm manages the parallelisation itself. However to support the same in an MPI environment, we implemented the parallel design used in Storm for MPI. This means that now Dispel4Py is also able to automatically translate a graph to an MPI application, without users having to know MPI.
Scientists have always shared data and mathematical methods and over the last two decades seismology and geosciences, as well as other solid-Earth sciences, have increasingly used Internet communications for this purpose. Therfore, we considered how to open up opportunities for sharing and comparing workflow components. In Dispel4Py there is a feature called 'Registry' to collect and share commonly used workflow components. It allows users to import these shared "Lego bricks" transparently into their code, using their favourite Python programming environment, and combine them when building their own workflows.
Dispel4Py and VERCE
Dispel4Py is being developed as part of VERCE. This project is creating a data-intensive e-science environment to enable innovative data analysis and data-modelling methods that fully exploit the increasing wealth of open data generated by the observational and monitoring systems of the global seismology community. It is a Framework 7 project.
Image: Don, Flickr