HPC's role in Data Science

Author: Adam Carter
Posted: 12 Sep 2015 | 10:01

I was among those presenting at EPCC's recent Big Data seminars at Edinburgh BioQuarter and BioDundee. Both events provided a good opportunity for me to talk to people about their Big Data problems and their views on what Big Data means to them.

When I first read  Adrian Jackson's article about Big Data and HPC, my first response was that I was not sure that I agreed with it all — I’m less of a Big Data sceptic than Adrian! — but on reflection, I’ve decided that I do agree with most of it.

The extent to which I agree, I think, boils down to my working definition of Big Data: it’s Big Data if it's too big (in volume/velocity/variety) to handle using the tools and techniques we’ve been used to. It’s this last point, I think, that’s key.

For the many people out there who are not familiar with HPC, it could very likely be the solution to their Big Data problem (it’s just that they’re not used to using it). For a much smaller, but ever-growing number of problems however, HPC alone is not going to be the solution. Anyone who’s ever tried to use HPC as part of a workflow knows that getting data into and out of a program running on an HPC machine is still far from trivial. 

Whether a supercomputer is the solution to your Big Data problem will always depend on exactly what your Big Data bottleneck is. Is it that it’s too big to fit in memory? Or is it that you’re now working with data that’s so varied that the kind of data structures you’re having to create are difficult to process efficiently? New tools (whether that’s Hadoop, Spark, NoSQL databases or a workflow engine) all have their place in the toolbox for certain kinds of Big Data problems. The skill is being able to pick the right tool for the job, and the choice will often boil down to knowing where your existing techniques break down.

One thing I’m sure Adrian and I do agree on is that many of the ideas of parallelism remain the same, whether you're working on a supercomputer, a Spark cluster, or the four cores of your laptop processor.

Author

Adam Carter, EPCC 

Comments

Thanks for the great response Adam, I'm glad that someone else from EPCC who is working more on the data side (or should that be dark side...) has had a chance to put a different point of view.

I very much agree with your sentiments, I think what is of paramount importance going forward is the development of tools that make use of large scale compute much simpler for end users. Indeed, I think that "full lifecycle management" it what we should be focussing on, rather than just the technically interesting data or HPC parts.

Having said that I'm still worried that this trend for pursing HPC and Big Data as two separate, distinct, disparate, subjects could restrict the usage and usefulness of the tools that are developed to enable end-users to easily exploit this technology.

We probably need some proper interoperability standards for data formats and workflows and filesystems and databases, etc... However, the danger always is that thousands of hours are spent on standards and in the mean time people have been getting on and solving the problem in adhoc ways...

I guess there are no easy solutions here, I just hope the two disciplines don't become separate and segmented, but work together and end up producing useful tools and user environments across the board.