Making Scotland's medical images available for research

20 December 2017

I've worked on many data analysis projects in my career and two common themes are that obtaining the data can be a significant challenge and that once you obtain it you’ll notice it is very messy.

Many years ago my first job after graduating was developing computer vision systems to detect breast cancer in mammograms. This was 1993 and most medical imaging was produced on film. After obtaining permission to access the data, the project installed a film scanner at the screening centre, employed an operator to scan the films and wrote software to remove identifiable information from the images. Radiologists were now able to annotate cancer features in the images, providing us with the required training and testing sets. Getting to this point took a lot of time and effort.

Skip a few years and the algorithms are working encouragingly on images from one centre but not so on images produced elsewhere. This was due to a lack of any metadata regarding the configuration of the imaging device for each mammogram that made it hard to calibrate our algorithms. Mentally I filed the project in a box to revisit when digital imaging was ubiquitous and moved on.

Twenty five years later and digital imaging is ubiquitous and has been so for several years. Scotland now has a large collection of digital medical images going back to 2010. This image collection is an excellent resource for a wide range of research with the potential to provide enormous public benefit. Add to this the ability to link these images to the patients’ medical history including prescriptions, medical procedures and outcomes, and Scotland has a world-leading data set.

I am delighted to be working as part of a collaboration between EPCC, the Health Informatics Centre at The University of Dundee, and the electronic Data Research and Innovation Service (eDRIS) to enable safe and secure access to this data via the Farr Institute’s National Safe Haven. By making the data available in secure, anonymised and processable ways we will reduce the effort required to obtain the data and enable far greater usage of this essential resource.

Before the data is made available to researchers it must first be anonymised. Providing the software infrastructure required to support the anonymisation process will be the main focus of the 10-month project due to end in October 2018. Remember I said the data is always messy? Well this is the point where that will hit us! We expect to see inconsistent use of metadata fields, identifiable data in surprising places, important data in unstructured text fields, incorrect data entry and inconsistent use of terms. Anonymisation rules will need to be robust and yet maintainable, and support the incremental release of data as each imaging device becomes understood and trusted.

During my initial requirements-gathering interviews the enormous desire from researchers and industry for such an offering was made very clear. The project will be a challenge but will also be hugely rewarding and worthwhile and it is one I am very much looking forward to.