Securely citing datasets
Posted: 22 Aug 2013 | 14:50
This post was written by Adrian Mouat, a former EPCC employee who is now an independent software consultant.
Citing a paper is a reasonably straightforward and well-defined task; just give a reference to the author and the publication you found the paper in and you're pretty much there. Anyone else who wants to look up the reference just has to find the publication and they should see exactly the same text you saw.
Unfortunately, citing datasets is not as simple, at least not if you want the security of knowing that readers who follow the citation will find exactly the same data you used.
Typical advice1 is to use a citation such as:
Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127- 797. V. 2.1. Geological Institute, University of Tokyo. http://dx.doi.org/10.1594/PANGAEA.726855
Which seems pretty unambiguous and provides us with a nice DOI that uniquely identifies the resource2. However, I don't believe this goes far enough to protect authors who use and/or cite datasets belonging to others.
The issue is that the citation does not provide any means to verify that the data pointed to is the same data as seen by the author of the citation. Without such checks in place, the following scenarios are possible:
- The file pointed to by a DOI becomes corrupted. Several people publish papers based on the corrupted data. The owner of the data attempts to recreate the dataset, but can no longer remember which version of the data the DOI refers to.
- The creator of the dataset intentionally modifies the data in order to fix a typo or other simple mistake. Because of this, no-one can identically reproduce the results shown in papers citing the dataset. (Whilst it is bad practice to update a dataset without updating the version number and issuing a new DOI, there is nothing to prevent this from happening).
- The author of a paper citing his own dataset receives criticism that their data does not support their conclusions. The author simply modifies the data to support their stance, leaving the critic with no proof.
- A hacker gains control over a network link and modifies the data in transit, resulting in researchers using dangerously misleading data.
A checksum is a (comparatively) short string that is calculated from the data using a given algorithm. Identical files will always produce the same checksum. By using a powerful enough method (cryptographic hash function) we can be sure that two files which produce the same checksum are identical. In other words, if the file has been modified, the checksum will change. For more details, refer to this wikipedia article on cryptographic hash functions.
This means that our above citation should to be expanded to:
Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127- 797. V. 2.1. Geological Institute, University of Tokyo. http://dx.doi.org/10.1594/PANGAEA.726855. SHA1(Irino-Tada_2000.zip)= d02bc5fa873d033674b58c500b7ba42469b39dc4
The length and randomness of the string is unfortunate, but necessary for security. Some people may suggest that the checksum should be part of the DOI, but this only works if we can trust that the checksum in the DOI hasn't been changed (the DOI implementations I've seen allow for updating/changing of metadata including checksums and timestamps).
It's possible to see checksums in action at the minute by looking at popular (legal!) download sites. For example if we look at the download page for the Apache web server we can see PGP, MD5 and SHA1 checksums (or "signatures", in their parlance) for each of their downloads.
In order to validate or create a checksum you'll need to use some software. If you are using a linux or Mac OS, the easiest solution is to use openssl from the terminal. For example, to verify the SHA1 checksum for the Apache webserver:
$ openssl sha1 httpd-2.4.6.tar.bz2
We can see that this string matches the value on the website, so I can be confident that my download hasn't been corrupted or tampered with. Exactly the same process can be used to generate the checksum for my own files, which I can then include in any citations.
If you are on Windows or would rather not use the terminal, other solutions exist, such as HashTab for Windows and Mac OS or the MD5 Reborned Hasher which runs in the Firefox web browser. (Please note that I haven't used any of these solutions and can't vouch for their effectiveness).
Please consider adding such a checksum to your dataset citations and DOIs records. It protects both you and your readers.
 Example taken from DataCite Metadata Schema for the Publication and Citation of Research Data. Version 3.0, July 2013. doi:10.5438/0008 SHA1(DataCite-MetadataKernel_v3.0.pdf)= 79e1dd73f68118290e0bcad34999151284383fdf (I'm bewildered as to why this document doesn't even mention the issues addressed in this article.)
 DOIs, or Digital Object Identifiers, provide persistent and unambiguous links to entities. Note that the metadata associated with DOIs may be changed at any time and the linked entity is typically a URL.
 DOIs do support metadata, which may include checksums and timestamps. However, their inclusion is not mandatory and the metadata may be changed at any point. Enforcing the addition of such metadata (and making it immutable) would be a large improvement, but would still require all parties to trust the DOI infrastructure. By citing the checksum directly, you only need to trust the security of the checksum algorithm. Note that the metadata for the DOI cited in the article does include timestamps, but not a checksum.