Big Data: What's wrong with HPC?

Author: Adrian Jackson
Posted: 3 Sep 2015 | 17:55

Big Data vs HPC

Whilst I was writing my talk for last week's How to Make Big Data work for your business seminar at Edinburgh BioQuarter, it occurred to me that the way computational simulation codes have evolved over the last 40 years has really been a response to big data issues. 

There are many different definitions of big data, and many sceptics about whether it really lives up to the hype (I'm firmly in the sceptic camp, but that's maybe unsurprising given my HPC background).

To me, big data problems are those that have numerous data entries (i.e. variety) rather than just large datasets (any decent parallel program can produce tens of terabytes of data at the drop of a hat if run on enough processors with enough output enabled, but I'm not sure that should be classed as a big data problem), and the analysis required on that data necessitates lots of searching through the data to pick out specific elements and process them. The other aspect that is generally true of big data problems is that the data has already been generated elsewhere, and the challenge is to get useful information out of pre-existing data.

Computational simulation generally does not process lots of pre-existing data, although codes may use datasets for initial conditions in simulations or to compare results with. Rather it generates the data to be analysed on the fly. The core mathematical kernels of computational simulation codes simply utilise some initial data and a set of formulas to produce new data. The code will then process the data produced and generally condense it to the essential data required for the user to analyse the results later. The code then moves on to the next iteration. Periodically that condensed data is written out to file, often with some more detailed data that can enable the simulation to restart if it crashes or runs out of time on the machine.

So fundamentally, at least to me, the difference between the big data problem and the HPC simulation is that the HPC simulation generates and analyses its data in-situ, providing the minimum data to the user to enable them to undertake their analysis and validation, whereas big data problems consume pre-existing data and perform the same operation. 

If you accept this analysis, then the question becomes 'Do we need specialised machines or software infrastructure when we could tackle big data problems using standard parallel programs (eg MPI-based) that read in the data, process it, and write out the analysis?' 

Instead of spending most of the time performing the calculations required for simulations, this approach would primarily involve reading data in, but would still follow the standard distributed-memory parallel programming models, and utilise the high performance disk and network systems available on large scale HPC systems.

This approach also leverages the experience of 40 years of developing hardware and algorithms to address the processor-memory performance gap, which is another crucial issue for big data problems, just as it is for standard computational simulation. If we were still in the era where it was cheaper to load data into a processor than it was to perform instructions on that data, then big data problems would be much easier to tackle with computational hardware.

There are clear problems that would still need specialist approaches, including those where it is impossible to fit all the data into the memory of an HPC machine, although for many analysis problems it would be possible to stream the data in from disk in smaller chunks and undertake analysis on those pieces of data, since most operations do not require the full data set to be in memory for the analysis to be undertaken. 

Another set of problems that would also warrant specialised software or machines are those where a single data set will be analysed many times, either as users look for different answers in the data, or to enable the use of different analysis techniques. In this case loading the data into a format and distribution that enables recurrent analysis, ie on the nodes of a hadoop cluster or in a nosql database, would enable such investigations to be efficiently undertaken, albeit for the initial cost of the data setup and formatting.

However for many other problems where the issue is that a large dataset needs to be analysed and it's not easy to do using standard tools (ie Excel, gnuplot, R, etc...), it may be easier and more straightforward to analyse using a program that looks more like a parallel computational simulation program than it does a hadoop program or through a specialised database. Whilst it may not sound like there is a big difference for users/scientists, using HPC approaches does provide access to portability of programs and mature hardware and software stacks, which can be very useful in enabling access to the widest range of computational resource. 

Another approach would be to use a tool for data analysis that provides a simple user interface but will run on standard HPC resources. One such tool that EPCC has been involved in developing for a number of years is SPRINT, which provides parallel functionality to the statistical programming language R.

One area where the big data field is providing a valuable functionality that the HPC community should also offer is full lifecycle support for researchers.

Big data is not limited to data generation and analysis (which is how we could categorise computational simulation). It also focuses on support for importing and managing data set, giving tools to move and format data as required by the analysis, undertaking large-scale visualisation, and storing and curating data after analysis has been performed. 

Data import and export tools, storage, and visualisation are becoming increasingly important and time consuming activities for computational scientists, a fact that people like me can forget when focusing on the parallelisation and optimisation of code (although high performance visualisation has long been an area of research and is now routinely used on services like ARCHER through software such as VTK, Visit and Paraview).

Therefore, if big data can provide functional tools to help scientists with these issues, and also support those applications that really cannot utilise current HPC hardware for data analysis (think SKA, or Google pageranking), then the users of HPC and computational scientists in general will be better off.

Author

Adrian Jackson, EPCC