Accelerating high performance data pipelines using EPCC's supercomputing facilities
Posted: 15 Jun 2021 | 12:26
Illuminate, a commercial collaborator of EPCC, has a mission to accelerate informed decisions by providing their customers with the means to find specific needle in a haystack data points from the mass of their network traffic. As networks evolve, they are continuously innovating to ensure that their data pipelines operate at maximum efficiency.
Data analytics pipelines have three main functions: data ingestion, processing, and analysis. The data ingestion stage focuses on obtaining or extracting data from a range of data sources and importing it into systems that can enable processing and analysis. This is followed by data processing to validate, transform and enrich it, and load it into some form of queryable data storage, such as a data lake. Finally, we utilise Artificial Intelligence/Machine Learning processes to build models from the data in the lake and use them to gain insight.
The data ingestion stage was the subject of a recent research project, in conjunction with EPCC and Intel, where we investigated how network sensor data capture could be accelerated in a cloud environment. While concluding this work, it was announced that the Cirrus system had been upgraded to include 36 GPU nodes, each with 4 Nvidia V100 GPUs. As the data processing and loading stage is a prime candidate for GPU acceleration, we were keen to take the opportunity to run benchmarks on EPCC's Cirrus platform to investigate what opportunites these GPU resources provided for reducing processing and analysis runtimes and thereby generating quicker insight for customers.
Public clouds offer considerable compute resources so a logical question to ask is why Illuminate favoured a high-performance computing (HPC) facility for our investigation. Our experience is that cloud high performance data pipelines are often I/O bound and it takes considerable system engineering effort to tune the network and storage middleware in cloud deployments to obtain good performance for such pipelines. On the other hand, an HPC facility will offer a high-performance communications fabric coupled to a parallel filesystem that is already tuned to support the needs of the cluster for I/O intensive applications. Lastly an HPC service like Cirrus can provide the support required to get applications ported, running, and optimised on the system. The excellent documentation for Cirrus meant that only one support request was needed when installing and benchmarking our code.
Considerable value can be added to raw data in the processing stage by fusing it with reference data, and this exercise was the focus of our investigation. Take for example the IMPU (IMS Public User Identity) which is analysed by our roaming visibility solution. In 4G your phone number is turned into an IMPU to route calls and text messages and takes the following form:
The important parts in this data are the MCC (mobile country code) and MNC (mobile network code), which can be fused with reference data from other sources. In this hypothetical example, MNC 150 and MCC 310 are allocated to AT&T, which provides context for business applications such revenue assurance that need to know the carrier associated with a subscriber. This enrichment needs to occur as the data is loaded, at rates of over 100K events/sec. We implemented an enrichment benchmarking using the RAPIDS GPU accelerated data frame, programmed using Python Pandas. The benchmark read data from a file, performed four joins of three different reference tables and wrote the results to an output file.
To evaluate performance, we benchmarked the Cirrus V100 GPUs against some local CPU resources, and on NVidia T4 GPUs (through Google Colab notebook). As the graph in this figure shows, the V100 GPU provides more than 10x the performance of CPU instances running on a local server and double the performance of a T4 GPU on Colab.
The success of this initial benchmarking has prompted a wider investigation into the application of massively parallel acceleration of data analytics for Illuminate workloads. This work to accelerate high-performance data pipelines once again sees us having the opportunity to use the unique supercomputing facilities at EPCC to ensure that our pipeline remains at the leading edge of data engineering.