The Index on Censorship held a debate asking if transparency is bad for science on Tuesday evening. Two of the Royal Society’s ‘Science as a Public Enterprise’ working group were on the panel: Sir Mark Walport FRS and Baroness Onora O’Neill FRS. Onora mentioned a useful report by Kieron O’Hara for the Cabinet Office’s transparency team on transparency and privacy. Martha Henriques from the Science Policy Centre takes us through the paper…
The prospect of opening up scientific data to a wider audience is likely to have a profound influence on the progress and economics of science. Medical and drug trials are areas where increased sharing and reuse of data could have an enormously beneficial impact, with the potential for research to be faster, more efficient, and less expensive. But these areas of research, among others, often involve personal or private data of the subjects involved in the trials. When data is personal, how is it best to balance privacy of the individual with transparency of the research as a whole?
Kieron O’Hara investigates the challenges of striking the right balance between the privacy of the individual and the transparency of government. There are useful parallels that can be drawn to the creation of open science, which encourages the responsible release of scientific data. The philosophy behind efforts to pursue transparency in science is not merely based on moral assertions that transparency is generally a good thing, but instead the focus is on how to better exploit the economic value of data already in existence. The government envisages significant opportunity for commercial and social entrepreneurship as a result of large-scale release of government data in a reusable form; the same argument can be made for the scientific development made possible from the release of reusable scientific data. Direct economic benefit would come from development of new technologies, as well as the more indirect development of the UK’s knowledge economy and STI sector as a whole.
A central point that O’Hara highlights in his report is that privacy and transparency are not mutually exclusive ideals; there are many instances in which the two are not even in opposition. A main example of how improving one reinforces the other is the importance of retaining public confidence in transparency programmes, which necessitates protecting the privacy of individuals. While this is a valid and important point, there remain many instances where the two are in conflict.
A simple yet powerful method of resolving such conflicts is the anonymisation or pseudonymisation of datasets to avoid information being linked to a particular individual. Although promising, there are technical limitations to current anonymisation and pseudonymisation techniques. Further research into the level of risk posed by deanonymisation is necessary, which may also help to indicate ways of improving current protection techniques. O’Hara suggests that the imperfections of anonymisation must therefore be carefully considered before releasing individual datasets to the public.
Current protection techniques include the aggregation of data to less specific groups and the perturbation of data, which effectively adds noise to the dataset. The downside of these extra protections is that both detract from the potential value of the information contained in the dataset; the more altered the data the better the protection of the individual’s privacy, but the less its economic value.
Further issues arise in that even with these protective measures, deanonymisation has been shown in certain cases to be remarkably easy. O’Hara describes a case in the US where obtaining just three variables (date of birth, gender, and ZIP code) from an anonymised population register allowed 97% success rate of identification of over 50,000 voters. However, O’Hara maintains that “there is cause for optimism that sophisticated anonymisation, perturbation and pseudonymisation techniques will continue to allow the release of valuable data for use by the public, and the management of a negligible risk.”
These technical issues raise separate legal issues with the use of the term ‘anonymisation’. In common usage the term implies that the subject is guaranteed to be unidentifiable, which is proving to be an unrealistic expectation. There have been suggestions that the term be dropped from legal vocabulary due to these outdated connotations. As ‘anonymisation’ of data at best makes exact identification of an individual difficult rather than impossible, terms such as deidentifying, scrubbing, or disguising have been suggested to use in its stead.
The answer to these problems may lie in a reassessment of what individuals believe to be a reasonable risk. Current methods used to protect privacy can make deanonymisation extremely costly, time-consuming, and difficult, which will deter the vast majority of potential privacy breaches. Differential privacy is the solution that makes use of this principle. This would employ statistical standards to investigate whether an individual’s risk of identification would be significantly increased by their data appearing in a given dataset. Although ‘significant’ is a subjective term, this would at least allow individuals to make an informed decision about what level of risk they are content with.
This may go some way towards striking an acceptable balance between privacy of individuals and transparency of research, although it could be that deeper investigation into the development of protection techniques and their associated risks is necessary before concrete decisions can be responsibly made.