It means that even when you completely remove addresses, account numbers, and other PII it is very easy to reidentify people from such a dataset. Almost all re-identification attacks make use of this.
However, sensitive information can be compromised even if the identifiers are not unique. It is well known that 87% of Americans can be uniquely identified just from their gender, birthday, and ZIP code [3].
The Flaws in Common Anonymisation Techniques
To prevent such attacks, the commonly used method is to group and coarsen the identifiers by reporting only the age brackets, giving only the first three digits of ZIP codes etc. resulting in quasi-identifiers. This is done in such a way as to guarantee k-anonymity.
As a result, for any record and any set of quasi-identifiers, there are at least k-1 other records with the same quasi-identifiers. K-anonymity is a widely adopted method for anonymising data. However, it can often fail to protect sensitive information.
A straightforward example of that is the so-called homogeneity attack. Given a dataset of different medical conditions (clearly very sensitive information) for individuals, whose age, ZIP codes, and other identifiers have been coarsened in such a way as to ensure k-anonymity, it may still be possible to recover the sensitive information [4]. Simply all k individuals for a given set of quasi-identifiers can have the same medical conditions.
Hence, if a neighbour knows your age, ZIP code, and gender, it may well be that you fall in the category where all other k-1 individuals have the same condition as you. Basically, the situation arises whenever the sensitive information is not very diverse. The scarcity of data severely impacts k-anonymity. The effect becomes even more dominant for high-dimensional data with a large number of quasi-identifiers when even ensuring k-anonymity becomes harder [5].
The lesson from this is that inference attacks are often successful even when very few and coarse-grained data points are revealed.
Linkage Attacks: Connecting Information from Different Sources
Information disclosed by one dataset might not be all the information publicly available about the individual. This may initially be obvious but implies very non-trivial attacks. However, when combined with other data sources, the risk of privacy breaches increases. Such background information might not even be sensitive.
For example, background information that a particular medical condition is much more prevalent in a given age group or sex can increase the probability of identifying medical conditions for individuals in our previous example. Exploiting side information about individuals can lead to spectacular attacks.
The Latanya Sweeney Case
Arguably, one of the most famous is the one performed by Latanya Sweeney in 1997. A couple of years before that, Massachusetts Group Insurance Company had shared with researchers and sold to industry medical data that included performed medical procedures, prescribed medications, ethnicity but also people's gender, date of birth, and ZIP code.
Governor Bill Weld assured that the data had been fully anonymised. Sweeney paid $20 for the Cambridge Massachusetts voter registration list, which also contained these 3 characteristics. Thus by cross-referencing the two databases, she identified Weld's entry in GIC and his medical records.
The German Browsing History Incident
Another example comes from journalist Svea Eckert and data scientist Andreas Dewes. They set up a fake AI start-up, pretended to need some data for training their ML models, and obtained a free database of browsing history for 3m German users with a long list of 9bn URLs and associated timestamps. All this from a data broker.
Even though no other identifiers were available, they still managed to re-identify the browsing history of politicians, judges, and even their work colleagues. One way they could achieve it was by noticing that a Twitter user who visits Twitter's analytics page, leaves a trace of his or her username in the corresponding URL. Hence, by going to the corresponding Twitter profiles Eckert and Dewes could identify such individuals.
Interestingly, they also found out about a police force’s undercover operation. The information was in Google Translate URLs, which contain the whole text of inputs to the translator.
Tracing Personal Preferences in Anonymised Netflix Data
Even seemingly non-sensitive data have the potential to reveal a lot about us. Netflix learned it the hard way when it shared the database with movie ratings made by its users for the Netflix Prize competition.
They stripped off all the PII from the data, but as you probably know by now, it was still possible to identify some of the users. This was done by research from the University of Texas, which linked Netflix’s dataset to IMDB [6]. In this way, information about people’s political preferences and even their sexual orientation was compromised.