Investigating high-performance data engineering

8 January 2018

Big data has always been a part of high-performance computing and the science it supports, but new open-source technologies are now being applied to a wider range of scientific and business problems. We’ve spent time recently testing some of the big data toolkits.

Two of the largest drivers of big data applications have been mobile applications and the Internet-of-Things (IoT).

Sensor data

Smartphones contain a remarkable array of sensors and are very interesting as mobile devices for sensing and recording a user’s environment. The increasing prevalence of low-power sensors monitoring air and water quality, traffic density and so on gives an opportunity to take better socio-economic decisions with better data. And, whether from mobile phone or static sensor, all those data have to go somewhere. EPCC has been engineering back-end systems for mobile data streams in two important areas: disaster risk reduction and smarter cities.

Working with the Schools of Informatics and Geosciences, and with international NGO Concern Worldwide, we have created both the back-end and front-end for a system to record smartphone sensor data with the aim of predicting and improving earthquake resilience in buildings.

In related work we have been working with Edinburgh mobile app firm XDesign to gather and store sensor data from smartphones in moving vehicles.

The challenge with both cases is not in creating an app to record data but in creating an operational back-end that can scale to many data streams from many devices.

In exploring these challenges we have been building expertise in Amazon Web Service’s building blocks for streaming data applications – AWS Lambda, API Gateway, DynamoDB – and exploring the use of Spark and Elastic MapReduce as platforms for onward analysis. We have learned that the AWS toolkit is functionally impressive, but that performance for both data gathering and data analysis still needs careful design and some technical finessing.

Machine learning frameworks

We’ve also been looking at the effectiveness of machine learning frameworks on both GPUs and CPUs.

Our main test case here was inspired by a commercial partner and involves identifying features from Earth observation image data (our target is the Sentinel satellite data from the ESA Copernicus programme). Image analysis is a classic machine learning example: throw a large number of images at an algorithm (in our case a 27-layer convolution neural network); train the algorithm to identify the patterns we’re looking for; and test that algorithm on another large number of previously unseen images.

The machine learning framework we’ve chosen for the neural network is PyTorch, a Python-based scientific computing package which provides a deep learning platform using the power of GPUs. GPUs (general-purpose graphical processing units) are particularly well-suited to this sort of algorithm, and are attracting a lot of interest in machine learning circles. We trained the network on DeepLearn, a small in-house GPU server set up for running deep learning frameworks. This server contains a single NVIDIA Titan X (Maxwell) GPU, and single quad-core Intel Q6600 CPU. Training the network to classify images over 120 “learning epochs” took 8 hours.

In terms of results, the network has a recall score of 83.78% (ie 83.78% of the pixels that belong to our objects of interest are correctly predicted as such), and a precision score of 84.54% (ie 84.54% of the predicted pixels of interest are genuine pixels of interest), which is encouraging.

We wanted to know how much better than CPUs are GPUs for this kind of problem. This led us to explore optimisations of PyTorch running on Cirrus, our Tier-2 HPC system.

PyTorch supports use of the multithreaded OpenBLAS linear algebra library, and each Cirrus compute node contains two 18-core Intel Xeon E5 (Broadwell) processors, with each core supporting two hardware hyperthreads. A Cirrus node therefore has a possible 72 parallel hardware threads with which to compete with the Titan GPU. And compete it does: the 120 learning epochs that take 8 hours on DeepLearn take 10 on one Cirrus node.

Slower, yes, but Cirrus has 280 nodes like this.

Our next step will be to experiment with the brand new MPI-like parallel version of PyTorch to try to scale our image classifier across the whole machine.

This is just a selection of our recent work. There are more interesting things going on at EPCC in the space of large-scale data engineering than we’ve room for here – benchmarking large graph databases, for instance. I'll save those for another post!