Research data infrastructure: where next?

Author: Rob Baxter
Posted: 30 Jul 2014 | 16:11

The rise of data-driven science, the increasing digitisation of the research process, has spawned its own jargon and acronyms. “Research data infrastructure” is one such term, but what does it mean? 

Loosely speaking (and often in IT there isn’t any other way) we can think of research data infrastructure (RDI) as a collection of hardware and software designed to capture, store and manage the enormous volumes of research data gathered by a particular group or community. 

And, as research communities struggle with ever-increasing burdens of bits, we increasingly use “an infrastructure” (and therefore some infrastructures, one infrastructure of many) because when we talk about research today, the infrastructure is no longer as common and underpinning as we would all like.

In simple terms, RDIs are about the archiving, sharing, description and discovery of research data for re-use – typically re-use within a discipline as part of the validation and verification of published scientific results. 

European infrastructure research

There are 35 RDI projects currently listed on the roadmap of the European Strategy Forum on Research Infrastructures (ESFRI). These infrastructure projects cover a wide range of research disciplines and each focuses on meeting the needs of one particular community.

Each of the ESRFRI RDIs is a “vertical infrastructure”, and in a recent talk at University College London I described these as “RDI Phase I”. In Europe and globally, many efforts are now focusing on what we might term “Phase II”: connecting the vertical RDIs horizontally to create truly common, collaborative data infrastructure that cuts across disciplines. To achieve Phase II we need to harmonise a complex and dynamic set of factors, from how to describe data objects to user identities and legal licences. It is not easy. It’s such a challenge we need to ask if there’s any real value in doing it. Cross-discipline re-use of data? Really? 

Research Data Alliance 

The Research Data Alliance (RDA) was formed 18 months ago by a group of data-oriented researchers keen to begin the research data harmonisation process. With funding from the European Union, the US National Science Foundation and the Australian National Data Service, the RDA has become the global forum for discussing better cross-disciplinary data sharing.

Collaborative infrastructure

In Europe, the EUDAT project is building the foundations of a truly cross-disciplinary collaborative data infrastructure. It has so far created a resilient network of large-scale high-performance computing and data centres and individual community data repositories to provide a core set of data replication and management services. It aims to build, step by step, common pieces of infrastructure compatible with the existing investments of its ESFRI partners, weaving existing data repositories together in ways which are as non-disruptive as possible.

Next steps

Phase II of the development of RDIs is thus underway. What’s next?

Arguably Phase III involves the harmonisation not only of data infrastructures but also of computing infrastructure. We want to store and organise these data so other researchers can validate, re-analyse, combine and compute with them. 

As the volumes of collected research data increase, moving them around the Internet becomes less feasible, and bringing major RDIs together with significant HPC systems is the next obvious step to take. How best to achieve this strategy is still an open question. We need to balance the risks to data preservation with the need to provide access within a rich analytic environment. Does virtualisation solve every use-case? Are general-purpose computers suitable for data-intensive tasks? Do we need more data-set-specific “Data-Scopes”, built around particular archives?

And what of Phase IV? If the first three phases of RDI development are concerned with better storage of research data, Phase IV may be about not storing it. 

Between 1998 and 2008, telescopes participating in the Sloan Digital Sky Survey collected 25 TB of data. This year, the Large Synoptic Survey Telescope (LSST) will produce that in a week, and by 2019 the Square Kilometre Array telescope (SKA) will be producing 10 PB per year of “finished” data from a raw instrument feed of over 6 PB per second. From the current ESFRI roadmap, the EISCAT_3D polar geophysical imaging radar experiment will be generating 100 PB per year by 2020; that same year, the High Luminosity Large Hadron Collider, the upgrade to CERN’s LHC, will come online with estimated data rates of 500 PB annually.

Future directions

Can we store these data? We don’t know. Do we need to invent new ways to filter, process and analyse them as they stream off the detectors? Almost certainly. 

The future of research data infrastructures might well lie with neither high-capacity storage nor high-performance computing but with smarter instrument design.

Author

Rob Baxter, EPCC