DataLoch: Unifying health data in the Lothian region
20 November 2023
EPCC's Ally Hume gives a personal view of working on the DataLoch data service, developed by the University of Edinburgh and NHS Lothian to address major health and social care challenges.
Our health data is not as linked up as one might, naïvely, expect. That was my first lesson when we started work on the DataLoch service in the summer of 2019. Primary care data (eg your interactions with your GP) are held in one system and secondary care data (eg interactions with hospitals) in another system. Bringing these data together into a unified view of the health of the Lothian population to support research was to be the first goal of the DataLoch service.
Many research projects were using only secondary care data. This meant that although they knew how many patients were attending hospital with a certain condition, they did not know how many people with that condition were in the general population and not attending hospital for whatever reason. In addition to this, some useful data such as smoking history, alcohol use, and frailty are mostly captured though primary care, so researchers were often unable to evaluate the impact of these factors on the outcomes being studied.
Linking primary and secondary care data
Through the hard work of data acquisition, information governance, and technical teams, DataLoch now hosts routinely collected primary care data from 85% of GP practices in the NHS Lothian region, and these data are linked to data from several secondary care sources. This data collection captures individuals’ day-to-day interactions with health care such as visits to GPs and hospitals, test results, diagnoses, medicines, operations etc.
The DataLoch service provides trusted researchers access to de-identified data extracts within a secure digital environment. At DataLoch, we are very aware of the great responsibility that comes with managing these data, and we ensure researchers obtain only the data required to carry out their specific research and that the de-identification process protects individuals’ identities. Additionally, when applying to access the data, researchers must justify the public benefit and request only the data needed to perform their study.
Diving into DataLoch
That’s the high-level overview of DataLoch, but my work has been with the lower-level technical details of DataLoch’s data feeds. So let us pitch camp in the beautiful Scottish countryside and dive deeper into the Loch. Once under the water, it’s time to confess to those with a data systems background that DataLoch is much closer to a data warehouse than a data lake. It is essentially a relational database with highly structured data and a schema-on-write usage model rather than a data lake with unstructured or semi-structured file data with a schema-on-read model. So, let’s just say it's called DataLoch because that’s a far cuter name than DataWarehoose!
While we’re being candid and honest, I might as well admit I was not completely honest with you in the opening paragraph. I said primary care data are held in one system and secondary care data held in another. In truth primary care practices in Lothian use one of two different systems and secondary care data are held in myriad systems. While most secondary care data (such as A&E attendance, admissions, discharges, ward movements, and labs tests) come from one system, many other datasets come from specialised systems. For example, respiratory test results, detailed critical care data, details of cardiology surgery and procedures, and inpatient prescribing data all come from distinct systems. One of the challenges of DataLoch from a technical point of view is interfacing with these systems to build the data feeds and the subsequent maintaining and monitoring of these feeds.
Health data limitations
If lesson one was that health data are not as well linked as one might expect, lesson two was that health data can be messy. This was not really a new lesson for me, as all data I have worked with is messy, but one never stops discovering new ways for the data to be messy. One discovery is that when a patient returns to re-register at a GP practice at which they were previously registered, details of the start and end date of original registration are overwritten. This pattern occurs surprisingly often, with approximately 0.5% of previous registration dates being overwritten each year. Go back 20 years and details of around 10% of the registrations are missing. While this has no impact on front line medical care, it can have a significant impact on research where it is important to know that all members of cohort were observable throughout the period being studied. Being registered at a GP practice during the study period is a good proxy for observability. Now that we know this limitation, we document it and advise researchers and look at alternative sources of data to fill in the gaps.
A lot of health care data are coded using a variety of coding systems. ICD10 is used for conditions in secondary care, OPCS4 for operations, Read codes for all sorts of primary care data such as conditions, medicines, test results, administrations etc. These coding systems have their challenges but are very helpful and underpin a lot of data analysis and research.
Turning free text into information
Despite these coding systems, a large amount of valuable information is still captured in free-text notes. DataLoch is researching Natural Language Processing (NLP) technologies to firstly remove identifiable information from free text and secondly extract useful information from the text. Any useful information extracted from the free text can then be made available to researchers in structured formats that do not require them to see or process the free text. The DataLoch team will not be able to carry out all the required NLP tasks so by removing identifiable information from free-text data, the de-identified versions can then be made available to other NLP researchers via the DataLoch Secure Data Environment. This will allow many researchers to explore ways to extract useful information from these notes.
In addition to NLP, two other big challenges in the coming years for DataLoch will be to add geospatial and social care data. Geospatial data will allow researchers to link to other datasets and hence study the impacts of society on health. One such example is studying the impact of fuel poverty on childhood respiratory conditions. It is well known that health and social care can be better integrated than is currently the case. Including social care data within DataLoch will enable researchers to better study the myriad relationships between health and social care.
DataLoch has been, and remains, an exciting service to work on. As well as the technical team we have data analysts, data scientists, clinicians, experts in information governance, engagement and data acquisition, as well as programme, service and data managers, and administrators. This strong, diverse team provides a supportive and collaborative environment with an ethos to provide data for research but respecting the people whose data this is. This ethos is further supported by our Public Reference Group who assess the public value of applications and meet frequently to assess and advise DataLoch from a public perspective.
DataLoch website: https://dataloch.org/