Data analytics is about deriving insight from data. Data engineering is about getting the right data to the right place at the right time in the right way so that analysts can do what they do. Driven by the technologies of the big Web firms there are now many software frameworks available that allow users to get up and running quickly in data engineering. Scaling them efficiently, and tailoring them for particular research or business needs, though, is still a challenge. EPCC’s long experience of large-scale distributed systems gives us particular insight into the dos and don’ts of building efficient, effective data engineering systems.
Distributed data pipelines
Whether it uses Spark or Storm or a home-grown solution, whether it draws data from Hadoop or Kafka, direct from disk or streamed off a sensor array, the idea of a data workflow is a fundamental architecture for data processing. In a distributed data workflow, each specialised component receives data, processes them and passes them on to the next over simple network connections. From the early days of the UK e-Science Core Programme and the OGSA-DAI data streaming and distributed query engine, EPCC has been building distributed data workflow systems for over fifteen years. Our expertise lies in engineering for scale and efficiency, while avoiding bottlenecks, network deadlock and other pitfalls.
Case study: XDesign/Road Intelligence ltd
Case study: GCRF-REAR
Machine learning is principally about creating algorithms which can detect patterns in data; not just one pattern one time but general classes of patterns, like identifying faces or voices, spotting features in satellite photographs or clinical X-rays, or determining patterns of customer behaviour from volumes of transaction data. Developing the algorithms involves training a computational model (neural networks are a favourite) with a lot of appropriate data – a set of faces, for instance – so it can “learn” the key features of interest. While applying a machine-learned algorithm on new data is very quick – a handful of milliseconds – training can take days. Modern software frameworks like TensorFlow or Torch make it easy to get machine learning applications up and running, but the speed of training is still a major challenge. EPCC applies its expertise in parallel optimisation to push the performance of machine learning for a wide range of applications.
Case study: Energy risk analysis