Edinburgh International Data Facility: an overview of Phase 1

Author: Rob Baxter
Posted: 22 Nov 2019 | 12:10

Developed by EPCC, the Edinburgh International Data Facility (EIDF) will facilitate new products, services, and research by bringing together regional, national and international datasets.

Launched at the end of 2018, the Data-Driven Innovation Programme (DDI) is one of six funded within the Edinburgh & South-East Scotland City Region Deal. The DDI programme has ambitious targets to support, enhance and improve talent, research, commercial adoption and entrepreneurship across the region, through better use of data. DDI targets ten industry sectors, with interactions managed through five DDI Hubs (see below). The activities of these Hubs are underpinned by the EIDF.

EIDF will grow and mature with the DDI Programme, expanding in capacity and capability, responding to the needs of the innovation Hubs and, through them, to learners, researchers, innovators and entrepreneurs from across the region and beyond.

What is it?

You can think of the EIDF as a layer of storage and computing services presented as a private cloud and hosting a rich and growing collection of data. EIDF supports the long-term storage and curation of data assets, and their cataloguing, preparation and presentation as analytic-ready datasets for research and innovation. It offers a range of computing services, from web-based notebooks to rich desktop environments and seamless access to high-performance computing. 

EIDF supports learners, researchers and innovators across the spectrum, with services from basic data download through simple learn-as-you-play-with-data notebooks to full-throated, GPU-enabled machine learning platforms for driving AI application development.

EIDF also provides safe haven services to health and government users, following best practice in independent governance and supporting the linkage of complex personal data for public benefit research and policy-making under national and regional safeguards. Safe haven services can also be created for organisations wishing to host and govern access to their data assets in a highly secure environment. Safe havens are isolated from the rest of EIDF, with user approvals, data ingress and egress and permitted software all controlled by information governance bodies independent of the infrastructure itself.

Are we nearly there yet?

For this year and next, EIDF Phase I is all about development and co-design. Working with a number of key stakeholders and early adopters we are putting together the core elements of EIDF – including a new home in EPCC’s Advanced Computing Facility (ACF) Computer Room 4, due for completion in June 2020. This is a high-resilience, state-of-the-art facility with enhanced power and network connectivity to provide the long-term durability needed for a data archive. 

Meanwhile, the first hardware platform is in place in temporary accommodation elsewhere in the ACF. We already have 10 petabytes (PB) of new disk capacity, split 50/50 between the safe haven and non-safe haven sides of EIDF, and planned to increase to 12 PB by July 2020. We also have a small slice of the compute and service cloud in place, around 120 virtual machines plus 20 NVIDIA V100 GPUs, plus access to our Cirrus HPC service.

This initial development system is helping us shape building blocks for the future EIDF. We are laying down an architecture that will enable us to scale out storage or compute as needed, and putting together the supporting software layers for a variety of use-cases from our early adopters. 

Our partners in Phase 1 include the iCAIRD digital pathology archive, the National Collection of Aerial Photography, Health Data Research UK, the Administrative Data Research Partnership, the Paracrawl Internet Archive project, Albyn Housing Society, NHS Scotland, the Scottish Government, the Edinburgh Festival Fringe, SAS, the DataLoch and Local Authorities from across the region.

But what does it do?

By Spring next year we expect the first EIDF services to come online. They will be headed by a data catalogue, a point-of-first-contact for EIDF, incorporating an open metadata repository, easy access to open data and an approvals system supporting work with, and access to, restricted data. We also expect to launch browser-based “notebook” services using Jupyter and RStudio to support data analysis using Python and R.

Around the middle of 2020 we’ll launch our virtual machine services offering fully kitted-out desktops to support different kinds of data-orientated task: data analysis machines for statistical analysis with R, Python etc; a data science flavour for machine learning and data modelling (with or without GPU capability); and a data engineering variety for data flow software builders, with Spark, Scala, Kafka etc. We’ll also be rolling out browser-based access to the Bayes Centre’s state-of-the-art SAS Viya platform.

It’s an ambitious programme. The next twelve months will be busy – and yet this is only the first step along a very exciting road for EPCC, the University and the whole region.

Edinburgh International Data Facility website: www.ed.ac.uk/edinburgh-international-data-facility

Author

Rob Baxter, EIDF Programme Manager, EPCC

Image: your_photo/Getty Images