Building a scaleable, extensible data infrastructure
Posted: 8 Jul 2016 | 14:48
Modern genome-sequencing technologies are easily capable of producing data volumes that can swamp a genetic researcher’s existing computing infrastructure. EPCC is working with the breeding company Aviagen to build a system that allows such researchers to scale up their data infrastructures to handle these increases in volume without compromising their analytical pipelines.
To achieve the desired scalability and reliability, the system uses a distributed columnar database where the data is replicated across a number of compute and data nodes. More compute nodes and storage can be easily added as the data volumes increase without affecting the analyses that operate on the data.
The analytics code had to be re-written and parallelised to allow it to scale up as the volume of data increases. The new analytical pipelines operate on HDF5 extracts from the data store, with the data filtered at this stage to only include the data relevant to the subsequent calculation. HDF5 is a data model, library, and file format for storing and managing data. It is designed for flexible and efficient I/O and for high volume and complex data.
The pipelines use Aviagen’s inhouse queue management framework to exploit parallelism by distributing the tasks across a set of available heterogeneous compute nodes. Using this parallel framework, we are implementing a bespoke task library that provides basic functionality (such as matrix multiplication) so that a researcher only need plug together the various analytical operations they require. The framework deals with managing the distribution of the parallel tasks, dependencies between tasks, and management of the distributed data.
This system combining the columnar database and the parallel analytics library will allow data archiving and data processing in a scalable, extensible manner. Aviagen will be able to add more data analysis functionality as needed.
Andreas Kranis, Research Geneticist at Aviagen, says: “The collaboration with EPCC promises to give us the ability to handle increasingly large amounts of data.”
Amy Krause, EPCC
Eilidh Troup, EPCC