Developing an outbreak analysis platform
Posted: 27 Jul 2021 | 11:49
EPCC is working with ISARIC4C, the Coronavirus Clinical Characterisation Consortium, to develop an integrated analysis platform that will be hosted in part on the Edinburgh International Data Facility.
The data analysis platform provides a unique combination of linked, curated data from UK sovereign data assets, together with a flexible high performance compute space. Created for COVID-19 research, the ISARIC4C data analysis platform combines the data safeguards of an NHS trusted research environment, with more than £100M of new exabyte-scale computational capacity at the home of the UK national supercomputer. This creates a unique opportunity to combine clinical, biological, genomics and virology research in a secure, openly-accessible framework.
ISARIC4C is the world’s largest observational study of hospitalised patients with COVID-19.
By generating, integrating and analysing clinical, biological, genetic and virological data on patients with Covid-19 in UK hospitals, ISARIC4C has provided vital information for policy-makers and health providers, including weekly updates to SAGE that guide the public health response. Specific areas informed include vaccine effectiveness and choice of therapeutic agents for clinicial trials.
The outbreak analysis platform was developed by ISARIC4C to encourage and facilitate research by collating, linking and curating clinical and research data, enabling deep integrative analyses of multi-omic disease profiling, stratified by viral variant, clinical phenotype and outcome.
This platform now serves as a hub for a coordinated UK national research response to COVID-19 and hosts data from several sources. The ISARIC research data within the analysis platform is already linked to NHS Scotland secondary care and death records and linkage to NHS England data is currently being incorporated, as well as other research data including COG-UK variant data, GenOMICC genome sequence data and UK-CIC phenotype data. Future plans include primary care, immunisation and ONS data.
Analysis platform structure
There are two routes of access to the analysis platform: 1. NHS Trusted Research Environment (Safe Haven) for access to personal clinical data and data collected without explicit consent. 2. Rapid-access flexible compute for access to non-disclosive research data collected with explicit consent.
Within both these environments there is an additional division in the data: 1. Publishable “open access” data which any user can use and report as they wish, according to data protection and privacy rules; 2. Embargoed active research data, shared by academic investigators and available for linked analysis but not for publication without agreement from all contributors.
This design is intended to build trust in order to encourage immediate contributions of research data from academic collaborators.
Rapid addition of viral sequence data from the COG-UK platform will enable real-time detection of the clinical impact of new viral strains, in-depth biological study of reinfection, and host: pathogen interactions at a genetic and mechanistic level.
EPCC hosts the ISARIC4C and linked data inside the National Safe Haven, alongside the Scottish Covid-19 research database. The latter holds NHS Scotland datasets as well as ISARIC4C data and is used to extract linked data for researchers interested in COVID-related data on Scottish patients.
The main ISARIC4C database includes extracts from the Scottish database and extracts of NHS England datasets for English subjects enrolled in the ISARIC4C studies, received under approvals from NHS Digital. We also plan to include other research study data involving COVID-19 patients, notably the PHOSP and GenoMICC studies. Work is ongoing to ingest all these datasets with suitable anonymised identifiers that allow them to be linked to the ISARIC4C study data.
The ISARIC4C database is large, with over 2 million rows of 902 variables, and on average 11 rows of data for each patient. Cleaning scripts are also being incorporated into the platform, which creates additional outputs of aggregated and summary data as well as updating some outlying values.
Researchers requiring extracts of the data apply via eDRIS (part of Public Health Scotland), which oversees completion of appropriate training to use the Safe Haven and data agreements with the ISARIC4C management committee. Discussion takes place between the researchers, the nominated eDRIS Research Coordinator, and the EPCC Applications Developer, to determine exactly which data they need and are permitted to have. The EPCC developer creates the extracts, runs checks and transfers the extract data with accompanying documentation to the project area where the Research Coordinator runs further checks before releasing the data to researchers. So far we have provided some ISARIC4C-only data extracts and some linked to Scottish NHS data and variant of concern data. There are ongoing preparations to link to English NHS and other study data soon.
One of our largest challenges over the summer will be the ingest of up to 2PB of genomic data, both viral and host, from the GenOMICC and COG-UK studies. These datasets will be hosted in a secured area of the new Edinburgh International Data Facility (EIDF) and will be supported by significant computational capability from one of EIDF’s new HPE SuperDome Flex large-memory systems.
Getting all the data into the Safe Haven and wider analysis platform has been an ongoing challenge, with many stakeholders involved, but this is an exciting project with huge potential for increasing collaboration with researchers. There will be enormous scope for linking different data of interest to a great variety of specialists.