Australia is charging headlong into a privacy disaster as government open data initiatives come online without considering how to properly implement privacy safeguards and data anonymity.
Learning about data security and privacy the hard way is a bad idea. Unfortunately, Australians are being schooled on a national scale, especially the 10 percent of Australians that have been involuntarily included in the Australian government’s release of “anonymous” Medicare data.
It’s true that open data is great and a noble goal. It can provide access to data that enables important research; be used to shape public policy, direct services to those in need; and help improve the lives and health of our entire nation — there’s no doubt we need high-quality research.
But what price are we willing to pay for these benefits? Can we trust the capability of the Australian government departments to anonymise and secure private records before public release?
Data re-identification problems
A new paper from computer researchers at the University of Melbourne has demonstrated how small pieces of data can easily be reassembled in “linking attacks” to quickly paint a big picture of an individual. Having already created a furore by proving that doctor and supplier IDs can be reverse engineered, the researchers turned to examining the anonymity of patient data.
The researchers were able to identify medical procedures performed on individuals through analysis of loose geographic locations, and approximate procedure dates and ages. A key to the data re-identification process is that there’s simply not enough randomness in public Medicare datasets to truly protect the identity of individuals.
Data anonymity problems also increase greatly when we start to publish datasets with overlapping records, and compounded further when deliberately creating linking keys that enable the tracking of “anonymous individuals” across multiple datasets.
The greater good
What if your private and anonymised data can help to make Australia a better place? That’s a reasonable argument that was put forward in a report [PDF] by the Australian Productivity Commission that argued: “Increased access to data can facilitate the development of ground-breaking new products and services that fundamentally transform everyday life.”
Unfortunately, while attempting to scale the lofty heights of open data and public goods, the Australian government has again failed to consider privacy implications — something that might be a concern for the 10 percent of Australians that were randomly selected to have their personal data “anonymised” and publicly released. This is no surprise; anonymising data is really, stupendously difficult.
Data re-identification issues range from mildly embarrassing to serious and potentially life altering — a quick look at the Australian Medicare MBS data shows how unique some data can be. Ask yourself, what are the consequences of re-identifying the one girl in Queensland aged 5 to 15 who received “Pregnancy Support Counselling Services” — Medicare item number 4001 — in July 2016? What is embarrassing for some could be catastrophic for others.
The Australian government has been grappling with mathematics recently.
With regard to privacy, we’ve resorted to wielding a stick and yelling, “don’t do that!”, as our government attempted last year to introduce legislation to criminalise data re-identification.
Even once the legislation is passed — it’s currently in limbo — jealous ex-lovers, unscrupulous insurance providers, nosey employers, extortionists, and foreign governments won’t care that re-identification is prohibited by law. Instead, we need to consider data re-identification from a cryptographic or mathematical point of view. Ideally we’d have a secure way of making it statistically very unlikely that re-identification can take place.
Somehow we need to find a balance between having usable data that helps researchers, and maintaining the privacy of individuals and protecting the public. It’s not an easy problem to solve and it’s going to require more than a legislative approach and a weak redaction of people’s names.
Aggregate data out, open data in!
The strongest way of anonymising data is to aggregate records before releasing a summary. This way, the details of many people are merged into a single statistic that describes something like the “number of treatments for heart disease”. But aggregating data needs specialist data scientist skills. You can’t just have the work experience kid group some data together and call it a dataset. The aggregation process needs to ensure the data is correctly interpreted, that no “skew” is introduced, and that it’s still useful.
Releasing aggregate data and descriptive statistics is also the old business model — it’s so last millennium, we’ve been there and done that. We called it the Australian Bureau of Statistics and for a century they’ve provided complex data analysis and reporting. Furthermore, given that the ABS is on a job-cutting spree, the latest trend is toward more data self service and open data access.
A big problem and no easy solutions
Can we secure our private data with cryptographic certainty? No.
Unfortunately, that’s not how cryptography works. Quite simply, the goal of cryptography is to make data indistinguishable from random garbage, and random garbage is useless to demographers and public policy researchers. But we can decide how difficult we want to make “cracking” the data by compromising some of its statistical properties.
Methods for making data anonymous are incredibly difficult. The key is to provide subsets of data while adding enough random noise or removing obvious outliers. This needs to be done carefully to make sure there’s enough randomness to make guessing identities difficult, while still preserving “the vibe of the thing”.
While we can’t completely solve the problem, we should at least understand our options and be well informed of the risks. We can certainly do better than naively obscuring names and grouping birth dates.
Charging headlong into a privacy disaster
The government is charging ahead with open data access and disturbing discussion papers. Data releases are everywhere, from crime statistics, to proposals to use people’s mobile phone signals to analyse vehicular traffic. We’re hurtling headlong in to a privacy disaster and no one seems to be at the wheel.
We’ve seen departments like the ABS slash staff numbers, while making grabs for additional sensitive data. Meanwhile we’re told “relax, we’ve got this”, when it’s quite clear that’s not the case.
Australia’s thirst for data and our government’s track record for jumping the gun mean that we’ve approached open data initiatives without truly understanding the benefits or risks. We’ve done nothing to implement to protect individuals, such as anonymisation standards. This seems like something we probably should get around to sooner rather than later, before undoing the privacy mess becomes impossible.