6 minutes read

Understanding Differential Privacy: A Non-Technical Perspective

Grasp the concept of differential privacy, a cutting-edge privacy-enhancing technology, regardless of your technical proficiency. Explore its relevance in the future of data science and machine learning, and learn about its implications on privacy.

As we sprint into a future shaped by data, digital transformation, and rapid developments in AI, the quest for privacy protection is taking centre stage. The privacy measures that were once considered an afterthought, need to become an integral part of any data collection.

Differential privacy is a privacy-enhancing technology (PET) that allows data scientists to gain insights from highly sensitive data without compromising privacy.

This technology is gaining ground and popularity among technologists, researchers, regulatory, and government bodies—intrigued whether they can implement differential privacy in their own workflow.

The differential privacy framework is grounded in mathematical guarantees that protect individual information while still allowing for valuable data analysis. Read this article to understand the concept of differential privacy, and its relevance in the future of data science and machine learning, regardless of your technical proficiency.

Why Implement Differential Privacy?

The reason any organisation would want to implement differential privacy is clear. When releasing information, you don’t want others to be able to reverse-engineer the data.

Many believe that data masking and encryption should take care of this concern. However, these techniques do not ensure true anonymisation of data. Data in this form is known as pseudonymised data.

As defined by the Information Commissioner’s Office (ICO) :

“There can be confusion between pseudonymisation and anonymisation. For example, it is common to refer to datasets as ‘anonymised’ when in fact they still contain personal data, just in pseudonymised form.

With pseudonymisation, the processing reduces the links between individuals and the data that relates to them but does not remove them entirely. While individuals may not be identifiable from the pseudonymous data itself, they can be identified by referring to other information held separately. Both the dataset and the additional information are therefore still personal data. “

This poses a clear risk. To illustrate how, let’s look back at a well-known story about data privacy involving the Massachusetts Group Insurance Commission (GIC) in the 1990s.

A Lesson from History

The Massachusetts Group Insurance Commission had an idea back in the mid-1990s—it decided to release "anonymised" data on state employees that showed every single hospital visit with the goal of helping researchers.

GIC reassured the public that this data was safe as it had been "de-identified," removing specific identifiers like names and Social Security numbers, but retaining other demographic details, including zip codes, birth dates, and gender.

Then-Governor, William Weld, confidently proclaimed that such a dataset didn't risk privacy since identifiers were removed. In response, a graduate student named Latanya Sweeney from the Massachusetts Institute of Technology decided to challenge this claim.

By linking the supposedly anonymous dataset with publicly available voter registration data for the city of Cambridge, she accurately identified Governor Weld's medical records.

Preventing Reconstruction Attacks

This process, known as a "re-identification" or "reconstruction attack," revealed the privacy risks even in well-intentioned data-sharing practices. Furthermore, this incident solidified the argument that even anonymised data could potentially be re-identified, posing serious concerns for privacy and confidentiality.

Naturally, the conclusion wasn’t to get rid of conducting useful statistics, since data can aid development and research. However, it posed a significant question: how can we utilise the world’s most impactful data with a guarantee that individual privacy is not compromised?

Following this public scandal, there was a significant shift in thinking about data privacy. It influenced practices and regulations in this field and is believed to have led to the creation of differential privacy.

Understanding Differential Privacy

Differential privacy works by adding controlled “noise” into aggregated data being analysed, which creates individual anonymity within the dataset. Let’s illustrate this with a simple example.

Imagine you have a dataset full of personal information and you want to find out the average age of the group, without revealing anyone’s specific age. In differential privacy, we add noise—like randomly adding or removing ages—into the dataset.

This is designed to make it unfeasible to reverse-engineer the data and hide the individual contribution of any particular data point—while maintaining the overall trends. Therefore, when we calculate the average age of the group now, the result would be close to the real average but it won’t reveal anyone’s specific age.

Further, differential privacy is designed so that even if someone tries to subtract the “noise,” they still won't get the correct individual data. It's like mixing sugar into coffee—once it's in, you can't get it out! It allows you to get accurate, useful outcomes, while mathematically guaranteeing that no one can reverse engineer the original dataset.

The Privacy-Utility Trade-Off

In differential privacy, there's a value called epsilon (ε). It's like a balancing scale. On one hand, we have privacy, and on the other, we have the usefulness of the data. The smaller the ε, the more noise you add, and the more the scale tips towards privacy, and vice versa.

This delicate tradeoff between privacy and utility significantly lessens in larger populations, essentially anonymising the effect of individuals within a vast dataset.

As the UK ICO recently noted, differential privacy may be one of the few techniques capable of providing true data anonymisation (rather than mere pseudo-anonymisation). Yet, there is a limited number of data science and machine learning experts proficient in this area.

In short, most organisations want their Chief Data Officer to assure them that no private or sensitive data is being leaked. However, the truth is that today, very few can genuinely make that assertion—yet.

This presents a unique and exciting opportunity for professionals to develop their skills in differential privacy to step up and support top-tier leadership in high-stakes organisational discussions.

With the ever-increasing data collection today, coupled with rigorous privacy governance symbolised by GDPR and CCPA, the necessity for differential privacy amplifies. This demand is further aligned with the broader societal demand for more secure, private, and responsible data usage.

What Are the Implications of Differential Privacy?

Security

Differential privacy provides peace of mind, knowing that no individual information is being disclosed. Thus, it ensures the insights drawn from data cannot be used maliciously or unethically. For instance, you won't be able to discern an individual's exact whereabouts or personal habits from differentially private data.

Collaboration

Differential privacy offers a path to secure data sharing, enabling organisations to collaborate securely and harness the potential of the data that was previously off-limits as well as engage in multi-partner research projects across organisations or departments.

Governance

With differential privacy, you can publish various reports, share data for collaboration, or perform analyses with a mathematical guarantee that you are not infringing on privacy regulations, such as GDPR.

Trust

As regulators and consumers scrutinise businesses for their data practices, companies adopting differential privacy demonstrate a proactive, ethical stance towards data privacy. By doing so, they are bolstering their reputation, and sparking consumer trust.

Data Monetisation

Leveraging differential privacy, companies can share statistical insights derived from their data without revealing sensitive information. This allows companies to indirectly monetise their data while safeguarding privacy.

Who’s Using Differential Privacy?

Many world-leading institutions recognise the ever-growing importance of differential privacy, are integrating it into their systems, and investing in its research. Let's look at some real-life examples:

Alan Turing Institute — proposes using “health tokens” to create COVID-19 immunity passports by leveraging differential privacy methods.
Apple — the tech giant utilises differential privacy in Siri's training, injecting noise into the local model to reduce reidentification risk. Consequently, Siri learns the iPhone owner's voice and responds specifically to them, without Apple collecting raw user voice data.
LinkedIn — has applied differential privacy to protect user data while improving the platform's search functionality, making user profile edits private and not immediately searchable to other users.
The United Nations — has launched a pilot programme, UN PET Lab aims to offer hands-on support to national statistics organisations in their endeavours to adopt PETs. The UN The Privacy-Enhancing Technologies Task Team (PETTT) is developing further guides and best practices based on international case studies.
Google — A public resource showcasing population mobility changes amid COVID-19 interventions is based on location data from opted-in Google users. Differential privacy protects the user's visited locations and visits frequency.
Meta — In 2018, Facebook launched an initiative allowing researchers to study social media's impact on elections, using differential privacy to protect individual data. Meta also uses differential privacy in its data collection and analytics.
Microsoft — Microsoft Viva, a platform combining communications, learning, and analytics, uses differential privacy and de-identification in its “Insights” tool, making personal insights individually accessible, while inaccessible for managers or organisation leaders.
US Census Bureau — has leveraged differential privacy to minimise the risk of identification of individuals when publishing statistics from the 2020 Census.

The Bottom Line

Differential privacy is proving to be a game-changer in privacy technology. Organisations are looking for skilful experts who can navigate the complexities of differential privacy, balancing the delicate trade-off between privacy and utility.

Interested in learning more about differential privacy or other PETs? Join our free Antigranular platform where you can learn more about PETs, get hands-on with differential privacy, connect with data scientists, refine your skills, and take part in exciting competitions.