HPC's role in Data Science
Posted: 12 Sep 2015 | 10:01
I was among those presenting at EPCC's recent Big Data seminars at Edinburgh BioQuarter and BioDundee. Both events provided a good opportunity for me to talk to people about their Big Data problems and their views on what Big Data means to them.
When I first read Adrian Jackson's article about Big Data and HPC, my first response was that I was not sure that I agreed with it all — I’m less of a Big Data sceptic than Adrian! — but on reflection, I’ve decided that I do agree with most of it.
The extent to which I agree, I think, boils down to my working definition of Big Data: it’s Big Data if it's too big (in volume/velocity/variety) to handle using the tools and techniques we’ve been used to. It’s this last point, I think, that’s key.
For the many people out there who are not familiar with HPC, it could very likely be the solution to their Big Data problem (it’s just that they’re not used to using it). For a much smaller, but ever-growing number of problems however, HPC alone is not going to be the solution. Anyone who’s ever tried to use HPC as part of a workflow knows that getting data into and out of a program running on an HPC machine is still far from trivial.
Whether a supercomputer is the solution to your Big Data problem will always depend on exactly what your Big Data bottleneck is. Is it that it’s too big to fit in memory? Or is it that you’re now working with data that’s so varied that the kind of data structures you’re having to create are difficult to process efficiently? New tools (whether that’s Hadoop, Spark, NoSQL databases or a workflow engine) all have their place in the toolbox for certain kinds of Big Data problems. The skill is being able to pick the right tool for the job, and the choice will often boil down to knowing where your existing techniques break down.
One thing I’m sure Adrian and I do agree on is that many of the ideas of parallelism remain the same, whether you're working on a supercomputer, a Spark cluster, or the four cores of your laptop processor.