Big data, big compute
Posted: 17 Dec 2015 | 10:46
Gathering data for their own sake is pointless without the analysis techniques and computing power to process them. But what techniques, and what kind of computing power? The differing nature of data-driven analysis in different fields and areas of application demands a variety of computational approaches.
Through our current activities in big data at EPCC, we’re aiming to understand how big data and big compute can be brought together to create, well, big value.
Big data analytics
Big data analytics today typically means the application of statistical mathematics to large, complex datasets in order to uncover hidden patterns and provide useful insights. It is a response to the big data explosion; statistical approaches are the only feasible methods for many data-driven research questions, and many of these may demand serious computing power.
The adoption of big data analytics techniques by business is a relatively new phenomenon, but the application of computational, statistical and probabilistic techniques to extract bulk properties (averages, states, rules, phase transitions) from complex systems can be traced back to early 20th Century statistical mechanics. Science has been doing data analytics for a long time.
Volume, variety, velocity
The use of the “three Vs” to characterise “big data” provides a useful starting point for classifying digital data and the types of computation that can be brought to bear upon them.
The volume of a dataset can be measured two ways – the size of the individual digital objects (file size in bytes, for instance), and the number of digital objects in the dataset. These are not exclusive: a dataset may comprise a large number of very large digital objects, and both measures need to be factored in to the data’s underlying “computability”. And the numbers are staggering. Current estimates suggest that 40 zettabytes – 43 trillion terabytes – will be created globally by 2020.
Variety in a dataset is something of a misnomer; it refers properly to the complexity of analysis required for individual digital objects. In this sense, scientific data tends to be low in variety – simulations and sensors produce numbers encoded in well-specified ways – although the increasing quantity of science data recorded as digital images introduces the need for more statistical techniques in terms of classification theory, fuzzy matching and so on. Additionally, complexity can very quickly mount up once we start to look at combining data from multiple heterogeneous and potentially distributed sources.
The velocity of a dataset means the speed with which it changes; the live customer transaction database in a busy supermarket, for instance. In terms simply of the dataset there is a trade off between velocity and volume: a high-velocity dataset can be recorded as a static one of large volume (a transaction log, an initial state and a sequence of differences, or a time series of individual records).
One aspect of time which can be critical to the recording of a dataset (as opposed to its analysis) is whether the data are recording an observation or just a measurement. An observation is time-critical – eg a supernova explosion – whereas a measurement is, in principle, reproducible. Simulation data, for instance, are always reproducible; gene sequences are reproducible from the same strands of DNA. In some cases, a measurement which is in principle reproducible may be practically irreproducible – experimentally crashing a very large crude carrier into a pier, for instance. In these cases, these data can be treated as observations.
Time criticality in big data computing
Whether a dataset comprises observations or measurements speaks more to the care required to preserve it rather than the need to treat it differently in computational terms. The velocity of the dataset becomes important when compared to the time to analyse one of its component objects, and the degree of time criticality of the compute process itself.
In the context of computing with big data, time criticality can be thought of as the answer to the question “Do we have to analyse these data now, or can it wait until tomorrow?” If the answer is “tomorrow” (for a definition of “tomorrow” which is case-dependent, of course), the computing process is probably not time critical.
Time criticality in computing is not the same as data velocity, nor necessarily the same as real-time analytics. It is more a measure of the timescale on which the insights provided by the data-crunching process continue to be relevant, useful and valuable (another “V”): a weather forecast for yesterday has little value.
Much scientific data analysis is not time-critical, unlike in business where insights from data may have a useful life measured in months or weeks. Time-critical analytics are characteristic of decision-support systems.
Time criticality can have a significant impact on the design of the whole approach to data analysis. A sub-second decision timeframe demands enormously high throughput and blazingly fast algorithms – perhaps field programmable gate arrays or other reconfigurable hardware; high volume, high velocity data may need to be processed and reduced on a timescale dictated by the available storage capacity; algorithms may not be able to “wait” for a complete dataset to arrive but instead begin to work incrementally with data as soon as they become available.
The key to applying big computing to big data, then, lies in understanding not only the nature of the question being asked but also the nature of the underlying data – how big, how many, how complex, how time-critical. Here the three Vs can provide a useful means of classifying problems: big data, big compute, or both at once?
Big data at EPCC
Our big data projects address the problems of processing, storing and analysing raw data. Current examples include:
- AstroData: Astronomical data
- Aviagen: Genomic datasets
- EUDAT: European Data Infrastructure
- Farr Institute of Health Informatics Research: Building a new secure, data-driven computational platform for health research
- ICORDI: International Collaboration on Data Infrastructure
- Next Generation Sequencing: Collaborating with Edinburgh Genomics to provide the infrastructure and support to optimise high-throughput gene sequencing
- PERICLES: Fighting ‘semantic decay’
Rob Baxter, EPCC