This poses a clear risk. To illustrate how, let’s look back at a well-known story about data privacy involving the Massachusetts Group Insurance Commission (GIC) in the 1990s.
A Lesson from History
The Massachusetts Group Insurance Commission had an idea back in the mid-1990s—it decided to release "anonymised" data on state employees that showed every single hospital visit with the goal of helping researchers.
GIC reassured the public that this data was safe as it had been "de-identified," removing specific identifiers like names and Social Security numbers, but retaining other demographic details, including zip codes, birth dates, and gender.
Then-Governor, William Weld, confidently proclaimed that such a dataset didn't risk privacy since identifiers were removed. In response, a graduate student named Latanya Sweeney from the Massachusetts Institute of Technology decided to challenge this claim.
By linking the supposedly anonymous dataset with publicly available voter registration data for the city of Cambridge, she accurately identified Governor Weld's medical records.
Preventing Reconstruction Attacks
This process, known as a "re-identification" or "reconstruction attack," revealed the privacy risks even in well-intentioned data-sharing practices. Furthermore, this incident solidified the argument that even anonymised data could potentially be re-identified, posing serious concerns for privacy and confidentiality.
Naturally, the conclusion wasn’t to get rid of conducting useful statistics, since data can aid development and research. However, it posed a significant question: how can we utilise the world’s most impactful data with a guarantee that individual privacy is not compromised?
Following this public scandal, there was a significant shift in thinking about data privacy. It influenced practices and regulations in this field and is believed to have led to the creation of differential privacy.
Understanding Differential Privacy
Differential privacy works by adding controlled “noise” into aggregated data being analysed, which creates individual anonymity within the dataset. Let’s illustrate this with a simple example.
Imagine you have a dataset full of personal information and you want to find out the average age of the group, without revealing anyone’s specific age. In differential privacy, we add noise—like randomly adding or removing ages—into the dataset.
This is designed to make it unfeasible to reverse-engineer the data and hide the individual contribution of any particular data point—while maintaining the overall trends. Therefore, when we calculate the average age of the group now, the result would be close to the real average but it won’t reveal anyone’s specific age.
Further, differential privacy is designed so that even if someone tries to subtract the “noise,” they still won't get the correct individual data. It's like mixing sugar into coffee—once it's in, you can't get it out! It allows you to get accurate, useful outcomes, while mathematically guaranteeing that no one can reverse engineer the original dataset.
The Privacy-Utility Trade-Off
In differential privacy, there's a value called epsilon (ε). It's like a balancing scale. On one hand, we have privacy, and on the other, we have the usefulness of the data. The smaller the ε, the more noise you add, and the more the scale tips towards privacy, and vice versa.
This delicate tradeoff between privacy and utility significantly lessens in larger populations, essentially anonymising the effect of individuals within a vast dataset.
As the UK ICO recently noted, differential privacy may be one of the few techniques capable of providing true data anonymisation (rather than mere pseudo-anonymisation). Yet, there is a limited number of data science and machine learning experts proficient in this area.