Paper reference: Research on privacy protection technology for data release

contribution

This article focuses on the research of privacy protection in dynamic data stream publishing and multi-user collaborative data publishing.

In non-interactive data publishing, because the data owner does not know what kind of queries the data analyst will make on the anonymous data set, the design of privacy protection algorithms needs to satisfy both privacy and high availability.

The current research can be divided into three categories:

  • Research on privacy rules: such as K-anonymity, differential privacy
  • Research on algorithms for publishing anonymous data according to certain existing privacy rules is mainly about how to optimize algorithms to improve data availability: such as research on how to better allocate the privacy budget in the differential privacy model. (This article belongs to this type of work
  • Research how to more accurately measure the availability of anonymous data sets

Background knowledge of the attacker

  • External data: 1. Publicly available data. 2. Linked data: such as the information of the target individual related person. Through these additional information obtained from external data, the attacker can scrutinize the existence of tuples of the target individual in the anonymous data and further discover sensitive information.
  • Common knowledge: additional information about the distribution of sensitive information about the target individual, which can be obtained from many sources. For example, the opponent may hear from his colleague that the salary of another colleague is more than 10k; it is easy to catch a cold in winter, and the first step to the hospital is often to register first.
  • Knowledge based on privacy protection algorithms: The attacker may know the mechanism of the anonymous algorithm used. In some cases, the algorithm itself may disclose sensitive information.

The same patient may visit multiple hospitals and leave medical records, that is, when the same individual appears in different data sets, when a certain data set is compromised, it may cause the privacy of other data sets to be leaked. In addition, it is possible that the staff of a certain hospital has employment experience in multiple hospitals. If such a person comes to aggregate processing, privacy may be exposed.

Corresponding solution: each hospital independently anonymizes its own data, and then aggregates it with other hospitals; or uses the security strategy of collaborative data processing to safely aggregate and then anonymize it.

If the adversary has sufficient background knowledge, sometimes even numerical statistical information showing very few attributes will leak the privacy of the individual.

For example, the hospital provides query services for data analysts. If you set a certain disease to 1 and no disease to 0, you can query the statistics of the number of patients in the first i row of the data set. Statistics can be used to differentiate the disease information of the individual in the i-th row. If the attacker has the background information of "Rose is located in the i-th row of the data set", his disease information will be leaked.

Type of data set

  • Static data set
  • Dynamic data set
  • Collaborative dataset

The most commonly used basic methods of privacy protection technology

  • Generalization: Replace specific values ​​in tuples with specific intervals of elements. Since generalization retains the semantic meaning, it has less information loss than disturbance.
  • Disturbance: According to a certain probability distribution, the value of some attribute domain is replaced with other values ​​of the same attribute domain. Because the disturbed information is no longer true information, too much distortion may lead to poor accuracy of the analysis results. But it is useful for numerical statistical queries (such as aggregate queries) because it can retain the statistical information of the original data. Moreover, the disturbance data set based on the differential privacy protection algorithm can achieve the most ideal privacy protection effect.

Research work on background knowledge modeling for attackers

  • The (c, k)-safety algorithm proposes a model to describe the adversary's background knowledge, so as to quantify the attack threat; and proposes a way to ensure that the attacker cannot have background knowledge not exceeding k and not exceeding the probability c A method to infer the true sensitive attributes of any person under the conditions.
  • The Privacy skyline model [14] uses logical expressions to specify three types of background knowledge that an attacker may have:
    1) Knowledge directly related to the target individual;
    2) Knowledge indirectly related to the target individual (such as friends or relatives of the target person) );
    3)
    Knowledge of individuals with the same sensitive attribute value .

Guess you like

Origin blog.csdn.net/weixin_42253964/article/details/107063737