Modern genome-sequencing technologies are easily capable of producing data volumes that can swamp a genetic researcher’s existing computing infrastructure. EPCC worked with the breeding company Aviagen to build a system that allows such researchers to scale up their data infrastructures to handle these increases in volume without compromising their analytical pipelines.

To achieve the desired scalability and reliability, the system used a distributed columnar database where the data was replicated across a number of compute and data nodes. More compute nodes and storage can be easily added as the data volumes increase without affecting the analyses that operate on the data.

The analytics code was re-written and parallelised to allow it to scale up as the volume of data increases. The new analytical pipelines operated on HDF5 extracts from the data store, with the data filtered at this stage to only include the data relevant to the subsequent calculation. HDF5 is a data model, library, and file format for storing and managing data. It is designed for flexible and efficient I/O and for high volume and complex data.

The pipelines used Aviagen’s inhouse queue management framework to exploit parallelism by distributing the tasks across a set of available heterogeneous compute nodes. Using this parallel framework, we implemented a bespoke task library that provides basic functionality (such as matrix multiplication) so that a researcher only need plug together the various analytical operations they require. The framework dealt with managing the distribution of the parallel tasks, dependencies between tasks, and management of the distributed data.

This system combining the columnar database and the parallel analytics library allowed data archiving and data processing in a scalable, extensible manner. Aviagen was able to add more data analysis functionality as needed.