Spark-based genome analysis on Cray-Urika and Cirrus clusters

Author: Rosa Filgueira
Posted: 16 Jan 2019 | 11:06

Analysing genomics data is a complex and compute intensive task, generally requiring numerous software tools and large reference data sets, tied together in successive stages of data transformation and visualisation.

Typically in a cancer genomics analysis, both a tumour sample and a “normal” sample from the same individual are first sequenced using NGS systems and compared using a series of quality control stages. The first control stage, ‘Sequence Quality Control’ (which is optional), checks sequence quality and performs some trimming. While the second one, ‘Alignment’, involves a number of steps, such as alignment, indexing, and recalibration, to ensure that the alignment files produced are of the highest quality as well as several more to guarantee the variants are called correctly. Both stages compromise a series of intermediately computing and data-intensive steps that very often are handcrafted by researchers and/or analysts.

Analysing humanities data using Cray Urika-GX

Author: Rosa Filgueira
Posted: 11 Oct 2018 | 14:52

During the last six months, in our role as members of the Research Engineering Group of the Alan Turing Institute, we have been working with Melissa Terras, University of Edinburgh's College of Arts, Humanities and Social Sciences (CAHSS), and Raquel Alegre, Research IT Services, University College London (UCL), to explore text analysis of humanities data. This work was funded by Scottish Enterprise as part of the Alan Turing Institute-Scottish Enterprise Data Engineering Programme.