You are not truly "anonymous": How to delineate anonymous data and de-identified data?


The full text is 2715 words and the expected learning time is 7 minutes

Source: unsplash

Anonymization is to ensure the privacy of data, and companies use it to protect sensitive data. Such data includes:

 

· Private data

· Business information, such as financial information or trade secrets

· Confidential information, such as military secrets or government information

 

Anonymization provides an example for compliance with personal data related privacy regulations. Where personal data and business data overlap are where customer information lies. But not all business data is regulated. This article will focus on the protection of personal data.

 

Examples of sensitive data types

 

In Europe, regulators define any information related to someone (such as your name) as "personal data." Regardless of the form, any information related to this person meets the above definition. Since the last century, personal data collection has gradually become democratized, and the problem of data anonymization has begun to appear. As privacy regulations come into effect around the world, this matter is especially important.

 

What is data anonymization and why should I care about it?

 

We start with the classic definition. The EU’s General Data Protection Regulation (GDPR) defines anonymous information as follows: “Information that has nothing to do with identifying or identifiable natural persons, or personal information provided anonymously in a way that the data subject cannot or is no longer identifiable.”

  

Among them, "identifiable" and "no longer" are crucial. This not only means that your name should no longer appear in the data, it also means that you cannot be discovered from the remaining data, which is related to the process of re-identification (sometimes called deanonymization).

 

Similarly, the GDPR (in the contract) states an important fact: "... Therefore, data protection should not apply to anonymous information". Therefore, if you try to anonymize your data, you are no longer subject to GDPR data protection laws.

 

You can perform any processing operations, such as analysis or data monetization. This brings a lot of opportunities:

 

· Selling data is clearly the preferred use. Around the world, privacy protection laws are restricting personal data transactions, and anonymous data provides companies with another option.

 

· It brings opportunities for cooperation. Many companies share data for innovation or research, and anonymous data can help reduce risk.

 

· It also creates opportunities for data analysis and machine learning. The operation of operating sensitive data while maintaining compatibility is becoming more and more complex. Anonymous data provides safe raw materials for statistical analysis and model training. The prospects are bright. But in fact, truly anonymous data is often not as desired.

 

Scope of data privacy protection mechanism

 

There is a scope for data privacy protection. Over the years, experts have developed a series of technologies that integrate methods, mechanisms and tools. These technologies generate data with different levels of anonymity and different re-identification risk levels. It can be said that its scope covers personally identifiable data and even truly anonymous data.

 

 Scope of data privacy

 

On the left, there is data containing a direct personal identification number. Through these elements, you can identify your name, address or phone number. At the other end, it is the anonymous data cited by the GDPR.

 

As you can see, these data have an intermediate category. It lies between identifiable data and anonymous data, namely pseudonymous data and de-identified data. Please note that its definition is still controversial. Some reports consider pseudonymization as part of de-identification, while others exclude it.

 

The technology for generating this "intermediate data" is not inherently problematic. They can effectively minimize data. According to the needs of the use case, they will be related to each other and be useful. But remember that they cannot generate truly anonymous data, and their mechanism cannot guarantee to prevent re-identification, so it is misleading to call the data they generate "anonymous data".

 

Anonymity and "anonymity"

 

Pseudonymization and de-identification can indeed protect data privacy in some ways. But according to the definition of GDPR, they cannot generate anonymous data.

 

Source: unsplash

The pseudonymization technology removes or replaces the direct personal identification code from the data. For example, if you delete all names and emails from the data set, you cannot directly identify someone from the pseudonym data, but you can indirectly identify it. In fact, the remaining data usually retains indirect identification codes. After combining these information, direct identification codes can be created, such as date of birth, zip code, gender, etc.

 

In this regard, pseudonymization has a separate definition in the GDPR framework: "... personal data is processed in such a way that the data can no longer be attributed to a specific data subject without the use of additional information." In contrast to anonymous data, pseudonymous data complies with GDPR requirements.

 

De-identification technology removes direct and indirect personal identification codes from data. In theory, the line between de-identified data and anonymized data is simple. The latest news shows that there is technology to ensure that data will never be recognized again. This is a "suspicious crime from nothing" situation, de-identified data is anonymous before it is identified. Whenever experts try to re-identify data that was not initially identified, they push the development even further.

 

Data re-identification continues to redefine anonymity

 

The above mechanism types are not equally effective for privacy protection, so how to deal with these data is very important. Companies regularly publish or sell data that they claim to be "anonymous," but when the methods they use cannot guarantee "anonymity," it can bring hidden dangers.

 

Numerous incidents show that the privacy protection mechanism of pseudonymized data is still flawed. The indirect identification code in the data will bring a huge risk of re-identification. As the amount of available data grows, so do the opportunities for cross-referencing data sets:

 

· In 1990, MIT graduate students re-identified the governor of Massachusetts from de-identified medical data. She cross-referenced the information with public census data to determine the patient's identity.

 

· In 2006, as part of the research project, AOL (AOL) shared de-identified search data, allowing researchers to associate search queries with the individuals behind them.

 

· In 2009, as part of the competition, Netflix released an anonymous movie rating data set, and Texas researchers successfully re-identified users.

 

· Also in 2009, researchers were able to predict a person’s social insurance number using only public information.

 

Recent studies have shown that de-identified data can actually be re-identified. Researchers from the University of Leuven-Neuve in Belgium and Imperial College London found: “Using 15 demographic attributes, 99.98% of Americans can be correctly re-identified in any data set.”

 

Another study on anonymous mobile phone data shows: "Four spatiotemporal points are enough to uniquely identify 95% of individual users."

 

Technology is advancing day by day, more data is being created, and researchers are working hard to draw the line between de-identified data and anonymous data. In 2017, researchers published a paper stating: “Internet browsing history can only be linked to personal information on social media through public data.”

 

Another worrying issue is the leakage of personal information. More and more personal information has been leaked. The ForgeRock Consumer Identity Leakage Report predicts that the number of information leaks in 2020 will exceed that of last year. In the United States alone, more than 1.6 billion customer records will be leaked in the first quarter of 2020.

 

Separately processed data sets cannot be re-identified, but combined with leaked data, it poses a greater threat. Students at Harvard University can use the leaked data to re-identify and de-identify data.

 

Source: unsplash

In short, what we think of as "anonymous data" is often not really anonymous data. Not all data sanitization methods will generate truly anonymous data. Everything has its own advantages, but none can provide the same level of privacy as anonymity. As the volume of data continues to grow, creating truly anonymous data becomes increasingly difficult, and the risk of companies publishing potentially re-identifiable personal data is also increasing.

 


Recommended reading topics

Leave a comment, like, send a circle of friends

Let's share the dry goods of AI learning and development

Compilation Team: Hao Yanjun, Zhu Yi

Related Links:

https://www.kdnuggets.com/2020/08/anonymous-anonymized-data.html

If reprinting, please leave a message in the background and follow the reprinting specifications

Recommended article reading

50 Interpretations of ACL2018 Proceedings

Interpretation of 28 Papers in EMNLP2017 Proceedings

Full links to China's academic achievements in the three top AI conferences in 2018

ACL2017 Proceedings: 34 interpretations and dry goods are all here

Review of 10 AAAI2017 classic papers

Long press to identify the QR code to add attention

Core reader loves you

Guess you like

Origin blog.csdn.net/duxinshuxiaobian/article/details/108633491