8 minutes read

Data Privacy Attacks: The Alarming Risk of Reconstruction Attacks on Seemingly Anonymous Data

This article sheds light on the fine line between anonymity and exposure in data privacy, revealing how seemingly secure, anonymised datasets can be exploited to uncover sensitive personal information. Dive into real-world examples, understand the pitfalls of common anonymisation techniques, and discover advanced solutions to protect data privacy.

Imagine you are analysing data. You might be running ML prediction algorithms, training models, or calculating various statistics. It seems straightforward to assume that by stripping away all personally identifiable information (PII), such as names, addresses, or telephone numbers, your data becomes anonymous and secure. However, this assumption is not true.

In this article, we'll look at how reconstruction attacks can transform seemingly anonymous datasets into a goldmine of personal details. You'll understand the pitfalls of common anonymisation methods, explore several examples and discover solutions that could help avoid such scenarios.

The Deceptive Safety of Seemingly Anonymised Data

Let's consider a scenario where you're aggregating data across multiple individuals. You might not think there are any privacy issues. However, even simple aggregate statistics can reveal more than intended.

An example of how things might go wrong with aggregate statistics is publishing an average salary of 100 employees and then publishing an average of 101 employees after a new person has joined. This allows anyone with access to these aggregates to easily figure out the salary of a new employee.

Even though that might seem like an obvious thing we can easily avoid, it becomes much trickier when revealing a range of statistics and aggregates, in different contexts. Things get even more challenging when such information is combined with other data sources about the same individuals.

This issue isn't just theoretical. Many large companies and governments have fallen prey to these pitfalls. The key lies in a structured approach to data sharing, which, if neglected, can significantly compromise the privacy of individuals within the dataset.

So let's talk about how you can avoid the same peril!‍

The Ease of Identifying Individuals

The notion that we are just a needle in a global haystack of 7.7 billion people is comforting but misleading. Studies have shown that it takes astonishingly few data points to uniquely or with high probability identify an individual.

For instance, just four spatiotemporal points from credit card metadata are enough to uniquely reidentify 90% of individuals [1]. Similarly, another study found that only four points of mobility data from mobile phones, with a time resolution of 1h and the spatial resolution determined by the distance between antennas, are sufficient to identify 95% of individuals (and two randomly drawn points identified over 50%) [2]. The ease with which people can be reidentified from such datasets is a stark reminder of the vulnerability inherent in our digital footprints.

The task is even easier for an attacker who cleverly uses non-uniform sampling e.g. by exploiting the fact that calls from an office at 2 am might provide more information about an individual than calls at 3 pm, when the office is crowded. Similar attacks can be performed by using other mobility data from geotagging used by social media platforms, smartphone apps, and others.

It means that even when you completely remove addresses, account numbers, and other PII it is very easy to reidentify people from such a dataset. Almost all re-identification attacks make use of this.

However, sensitive information can be compromised even if the identifiers are not unique. It is well known that 87% of Americans can be uniquely identified just from their gender, birthday, and ZIP code [3].

The Flaws in Common Anonymisation Techniques

To prevent such attacks, the commonly used method is to group and coarsen the identifiers by reporting only the age brackets, giving only the first three digits of ZIP codes etc. resulting in quasi-identifiers. This is done in such a way as to guarantee k-anonymity.

As a result, for any record and any set of quasi-identifiers, there are at least k-1 other records with the same quasi-identifiers. K-anonymity is a widely adopted method for anonymising data. However, it can often fail to protect sensitive information.

A straightforward example of that is the so-called homogeneity attack. Given a dataset of different medical conditions (clearly very sensitive information) for individuals, whose age, ZIP codes, and other identifiers have been coarsened in such a way as to ensure k-anonymity, it may still be possible to recover the sensitive information [4]. Simply all k individuals for a given set of quasi-identifiers can have the same medical conditions.

Hence, if a neighbour knows your age, ZIP code, and gender, it may well be that you fall in the category where all other k-1 individuals have the same condition as you. Basically, the situation arises whenever the sensitive information is not very diverse. The scarcity of data severely impacts k-anonymity. The effect becomes even more dominant for high-dimensional data with a large number of quasi-identifiers when even ensuring k-anonymity becomes harder [5].

‍

The lesson from this is that inference attacks are often successful even when very few and coarse-grained data points are revealed.

Linkage Attacks: Connecting Information from Different Sources

‌Information disclosed by one dataset might not be all the information publicly available about the individual. This may initially be obvious but implies very non-trivial attacks. However, when combined with other data sources, the risk of privacy breaches increases. Such background information might not even be sensitive.

For example, background information that a particular medical condition is much more prevalent in a given age group or sex can increase the probability of identifying medical conditions for individuals in our previous example. Exploiting side information about individuals can lead to spectacular attacks.

The Latanya Sweeney Case

Arguably, one of the most famous is the one performed by Latanya Sweeney in 1997. A couple of years before that, Massachusetts Group Insurance Company had shared with researchers and sold to industry medical data that included performed medical procedures, prescribed medications, ethnicity but also people's gender, date of birth, and ZIP code.

Governor Bill Weld assured that the data had been fully anonymised. Sweeney paid $20 for the Cambridge Massachusetts voter registration list, which also contained these 3 characteristics. Thus by cross-referencing the two databases, she identified Weld's entry in GIC and his medical records.

The German Browsing History Incident

Another example comes from journalist Svea Eckert and data scientist Andreas Dewes. They set up a fake AI start-up, pretended to need some data for training their ML models, and obtained a free database of browsing history for 3m German users with a long list of 9bn URLs and associated timestamps. All this from a data broker.

Even though no other identifiers were available, they still managed to re-identify the browsing history of politicians, judges, and even their work colleagues. One way they could achieve it was by noticing that a Twitter user who visits Twitter's analytics page, leaves a trace of his or her username in the corresponding URL. Hence, by going to the corresponding Twitter profiles Eckert and Dewes could identify such individuals.

Interestingly, they also found out about a police force’s undercover operation. The information was in Google Translate URLs, which contain the whole text of inputs to the translator.

Tracing Personal Preferences in Anonymised Netflix Data

Even seemingly non-sensitive data have the potential to reveal a lot about us. Netflix learned it the hard way when it shared the database with movie ratings made by its users for the Netflix Prize competition.

They stripped off all the PII from the data, but as you probably know by now, it was still possible to identify some of the users. This was done by research from the University of Texas, which linked Netflix’s dataset to IMDB [6]. In this way, information about people’s political preferences and even their sexual orientation was compromised.

The main takeaway from these attacks is that linking information from different data sources can lead to severe privacy leakages.‍

Machine Learning Models: A New Frontier for Privacy Attacks

All the examples so far were considered with attacks based on some publicly released data. However, one does not need to have direct access to such data to learn about sensitive information of individuals. Another example comes from attacks on machine learning models.

It has been shown that one can learn about the statistical properties of trained datasets simply from the parameters of trained machine learning models. Not only that, it is also possible to perform attacks given only black-box access to a model by using it to run predictions on input data.

Researchers from Cornell Tech have shown that even models trained on MLaaS offerings of Google and Amazon can be open to membership inference attacks [7]. In this scenario, an attacker can say whether a given record was used as a training dataset.‌

How To Handle This?

‍‌In the current data economy, companies, organisations, and individuals share vast amounts of information. Banning this is probably unfeasible and counterproductive in the long term. Considering the challenges outlined in this article, relying solely on traditional data anonymisation techniques is insufficient.

We believe that privacy-enhancing technologies (PETs) need to be employed in order to tackle the privacy challenges since these techniques offer more robust solutions. Secure enclaves can protect data during computing and ensure that it is processed only according to a pre-agreed specification. Differential privacy can be employed in training ML models, building synthetic data, and sharing aggregates with privacy guarantees.

‍

As we continue to navigate the complex landscape of data privacy, it's clear that a proactive approach is needed. Employing advanced PETs can help organisations tackle these challenges effectively. If you have encountered any such privacy challenges and you wish to run PETs in your environment, contact us via our website!

References:

‍[1] De Montjoye, Yves-Alexandre, Laura Radaelli, and Vivek Kumar Singh. "Unique in the shopping mall: On the reidentifiability of credit card metadata." Science 347.6221 (2015): 536-539.

[2] De Montjoye, Yves-Alexandre, et al. "Unique in the crowd: The privacy bounds of human mobility." Scientific reports 3.1 (2013): 1-5.

[3] Sweeney, Latanya. "k-anonymity: A model for protecting privacy." International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10.05 (2002): 557-570.

[4] Machanavajjhala, Ashwin, et al. "l-diversity: Privacy beyond k-anonymity." ACM Transactions on Knowledge Discovery from Data (TKDD) 1.1 (2007): 3-es.

[5] Aggarwal, Charu C. "On k-anonymity and the curse of dimensionality." VLDB. Vol. 5. 2005.

[6] Narayanan, Arvind, and Vitaly Shmatikov. "Robust de-anonymization of large sparse datasets." 2008 IEEE Symposium on Security and Privacy (sp 2008). IEEE, 2008.

[7] Shokri, Reza, et al. "Membership inference attacks against machine learning models." 2017 IEEE Symposium on Security and Privacy (SP). IEEE, 2017.

privacy attacks

eyes off my data

data privacy

anonymisation