Wait a minute, where *are* my data?

Author: Rob Baxter
Posted: 7 Aug 2013 | 11:56

Policy restrictions on data storage can make the straightforward technological problems complex, over-constrained and potentially insoluble.

Pic credit:  Jeff Rowley Big Wave Surfer

As the slowly toppling wave of research data begins to overwhelm us all, we're increasingly looking for new ways to automate the management of all these bits. Keeping human curators and data managers in the loop becomes ever more unscalable and unsustainable. So, we're storing data in the Cloud, auto-replicating them five ways so we don't lose any, letting the systems manage the data for us.

The technological barriers to increasingly autonomous data management can, by and large, be solved, given the will, the resource and a decent engineering approach. Moving bits around is solved (see Stephen Booth's recent post for proof). The big challenges surrounding research data management are becoming less technical, and much more policy-driven.

The fundamental nature of any sort of offsite, cloud-based storage is that you neither know nor care any longer where your data are stored - they are available wherever you are.  But what if you do care? Policy restrictions on data can suddenly make the straightforward technological problems terribly complex, over-constrained and potentially insoluble. Yet for any of today's distributed research infrastructures, these policy questions are inescapable.

National and international laws create one immediate set of concerns. What if the data in question cannot, for reasons of copyright, leave the data centre at which they are curated? Automatic off-site replication to a cloud may actually be unlawful. Certain categories of data may be made freely available for research purposes within a country, but national laws may forbid its export beyond the borders – as is the case with medical research data in Germany, for example. How can one ensure that the off-site storage cloud is still within the “nation of curation” – and that it doesn’t take a network route that crosses a border on its way there? What if the transmission of data to a remote site is allowed, but requires additional formalities to be completed first – the copying of digital art to an off-site cloud may very well be considered an inter-museum loan, with consequences for insurance, legal guardianship and copyright once again.

Automating research data management while retaining effective policy control is a hot topic, and a number of projects (PERICLES  is one, EUDAT  another) are exploring new ways to automate not only the data curation process, but the way policy decisions may change the very process itself - all to maintain a clear understanding of what's happening to your data in an increasingly automated world.  Because sometimes, you really do care where your data are.

Contact

Rob Baxter, EPCC