Data infrastructure: highlights of the EUDAT Conference 2013

Author: Rob Baxter
Posted: 13 Nov 2013 | 14:27

EUDAT - the European Data Infrastructure project - has reached the end of its second year and has, with some success, distilled the first version of a common, collaborative, horizontal data infrastructure from among the vertical stacks of its various partners.

EUDAT's task - to construct the first major cross-discipline research data infrastructure in the EU - is not quite the same as changing the engines of an aeroplane while it's in flight, but it shares aspects of the same challenge. In late October, over 200 data scientists gathered in a pleasantly, if a little unseasonably, mild Rome for the Second EUDAT Conference.

The Conference is partly a showcase for EUDAT and partly a round-up of other major activities on the bleeding edge of research data management. It began in highly parallel fashion on the Monday with a training programme covering some of the core technologies used in EUDAT's emerging collaborative data infrastructure - iRODS, GridFTP, Handles, Invenio - and a series of associated workshops on data management in the humanities, social and environmental sciences.

Open is better

The Conference proper opened on Tuesday with an excellent motivating keynote from Richard Frackowiak, Director of the Department of Clinical Neuroscience at CHUV University Hospital, Lausanne. He introduced the major new Human Brain Project and made a special plea for domain scientists and informaticians to work together. Our grand research challenges, like understanding the brain, are fundamentally cross-discipline these days, and data and computing infrastructure are an indelible part of the picture.

This joined-up thinking in research infrastructure was echoed very clearly by Kostas Glinos, Head of the eInfrastructure Unit in DG-CONNECT at the European Commission. He described the upcoming Framework Programme 8 - Horizon 2020 as it's known - and the need for transverse, cross-disciplinary approaches. He also stressed openness: openness in science, openness in access. "Open is better," he noted.

The opening session was rounded off by a welcome from Kimmo Koski, Managing Director of CSC Finland and Coordinator of EUDAT, with observations on the global scale of modern science, and the need for infrastructure to match. He also noted the presence of a good number of his fellow Finns from CSC. "But then," he remarked, "you only need to take a look at the weather outside."

Other highlights from Tuesday's plenary sessions included:

• A talk from Bill Michener, Professor and Director of e-Science Initiatives for University Libraries at the University of New Mexico, and Principal Investigator of the US DataONE project. He summarised the progress, challenges and lessons from four years of building global infrastructure for environmental science, and underlined Richard Frackowiak's point that progress now demands conversations and collaborations between the scientists and the technologists. He highlighted an early study from DataONE that found over 80% of scientists were keen to share their data but only 6% currently did; a telling corollary in the scientists' use of metadata indicated that, while some 450 did use standard metadata formats in describing their data, 266 used their own lab's "standard" and 676 used nothing at all.

• Ewan Birney, Associate Director of the EMBL-European Bioinformatics Institute, gave a very engaging talk on the rapid changes seen in genomic and proteomic science in the last decade. He noted, with a scary number of logarithmic graphs, the rise in data volumes driven by the crash in creation costs - between 2007 and 2010 the costs of sequencing genomes fell by three orders of magnitude to a few tens of cents a time. How much of this we can continue to store, let alone curate intelligently, is no longer a question for tomorrow but one for right now. 

Global approach

The Wednesday plenary session highlighted the urgency of the scientific data agenda through the rapid - meteoric, even - emergence of the Research Data Alliance over the last year. The RDA is a new forum for debate, discussion and above all doing in the dangerously fragmented landscape of data standards, practices and technologies. John Wood, RDA Council Chair and Secretary General of the Association Commonwealth Universities, emphasised once more the emerging theme of the Conference: that the challenges of managing our research data go well beyond institutional or national boundaries, and only a global approach will do.

Of the four parallel tracks over the two days, this delegate found the sessions on policy and sustainability issues both fascinating and engaging. (Declaration of interest: I chaired the programme committee!) We heard some excellent contributions to the debate from speakers including Simon Hodson, new Executive Director of ICSU CODATA, Kevin Ashley and Sarah Jones of the Digital Curation Centre and Peter Doorn, Director of the Dutch Data Archiving and Networked Services institute.

Data curation and preservation

A couple of highlights relevant to EPCC's work in the PERICLES project included an interesting discussion on the increasingly pertinent topic of data citation and what technological support research infrastructure providers - like EUDAT - or inventors - like PERICLES - should be building in to their systems.

Another item for data infrastructures to consider, from a talk by Pawel Kamocki of the Institut fur Deutsche Sprache, Mannheim, was the issue of licensing for data products and the complexity of combining data for research purposes when licences differ. The imminent approval of version 4.0 of the Creative Commons licence suite may offer a good route for those seeking to standardise their offerings for data re-use.

But perhaps the most intriguing notion of the Conference, and something that PERICLES might perhaps wish to consider, came from Ewan Birney: DNA as a storage medium. With proper allowance for encoding redundancy, one gram of DNA can hold hundreds of terabytes of information - fully-retrievable, and these days cheap to read - and with only modest storage constraints it's known to last 40,000 years and more. Now that's long-term preservation!

Further information

Pericles project blog

Pericles at EPCC


Rob Baxter, EPCC

This post first appeared on the Pericles project website.

Blog Archive