Proof-driven queries to preserve patient privacy

4 March 2019

In our role as members of the Research Engineering Group of the Alan Turing Institute, Anna Roubickova and I worked with Efi Tsamoura and Benjamin Spencer (Department of Computer Science at the University of Oxford) on PDQ, a proof-driven query planner that has great potential within the realm of data science for medical research. 

To answer a research question, a data scientist may want to run queries across data held within distributed data sources. Downloading all the data locally may be impractical, especially when dealing with large volumes of data. It can be more efficient to allow the data scientist to treat these distributed data sources as a single virtual data source, and to run queries over this virtual data source. This is where a query planner is invaluable. A query planner can take a query, parse and analyse it and propose one or more query plans, ways in which the query can be parsed into sub-queries across the data sources and how the results of these sub-queries can be joined to answer the original query.

PDQ (proof-driven querying) is a Java query planner developed by the University of Oxford that can analyse queries and propose query plans but also evaluate these query plans against various constraints. These constraints can include minimising the volume of data that would be transferred from the data sources to the client prior to its joining or minimising the time taken to run the sub-queries over each data source. For medical research data, a critical constraint is privacy preservation.

A data scientist working in medical research may need access to different types of patient data (for example, age, gender, location, medical history, previous and current treatments etc) which are held within distributed data sources. Before passing this data to the data scientist, the data needs to be filtered to remove any information that would allow an individual patient to be identified (by, for example, their surname). However, even if the data has been filtered, combining data from different data sources could inadvertently deanonymise a patient.

Making medical data accessible to data scientists while respecting both privacy and efficiency-related constraints is a key target application of PDQ. Views over distributed data sources can be designed so that any attempts to access identifiable patient data are blocked and the data scientist is prevented from running their query. PDQ can also use algorithms to assess whether a query plan will preserve the privacy of patients when data from a number of data sources are combined. This, for example, allows PDQ to discard a highly-efficient plan that compromises privacy in favour of a less-efficient plan that preserves privacy. In December 2018, Efi demonstrated PDQ to NHS Scotland to great interest.

PDQ currently works with Microsoft SQL Server. It would be both unreasonable and unfeasible to expect data providers to migrate their existing data to SQL Server just to use PDQ. So it is important that PDQ supports other popular database management systems. Our work with PDQ looked at extending PDQ to work with the popular open source database management system PostgreSQL. We explored how to extract statistical information about relations from PostgreSQL using PDQ's SQL Server-specific code as a starting point. This statistical information is invaluable for query planners within database management systems themselves. To produce low-cost plans, a query planner needs to know at least an approximate distribution of the values in the data it runs a query over. However, it is not feasible to first collect precise information and then do the query planning. Therefore database management systems maintain approximate, statistical information about the data they manage. PDQ has code to access these statistics from SQL Server for use in its own query planner. During our work we discovered that the statistics in SQL Server and in PostgreSQL differ quite a lot, not only on syntactical level, but also in their design.

To import PostgreSQL statistics into PDQ's query planner, we developed code to retrieve statistics from PostgreSQL's pg-stats system table. The pg-stats table provides statistics on tables within PostgreSQL eg most common values, fraction of values that are null, and estimated number of distinct values. Java classes were developed to retrieve and store this information for each column within a table, for a named table.

In the future, additional methods will need to be implemented to derive, from these statistics, the information needed by PDQ's query planner. This could be done by extending PDQ's query planner to utilise PostgreSQL-like statistics or, preferably, developing a common representation of these statistics into which both PostgreSQL statistics and SQL Server statistics can be transformed.

A report on our work, "Extending PDQ to extract statistics from PostgreSQL", is available for download.

This work, which ran from May to November 2018, was supported by funding from Scottish Enterprise to the University of Edinburgh as part of the Alan Turing Institute-Scottish Enterprise Data Engineering Programme.