Paper reference: research on background knowledge

Bayes-based privacy leakage analysis in location services

Background knowledge reflects the adversary's ability to collect background knowledge. The adversary can gather background knowledge from the real world.

Worst-Case Background Knowledge for Privacy-Preserving Data Publishing

When reasoning about privacy in data release, it is necessary to consider the background knowledge of the attacker. However, in practice, the data publisher does not know what background knowledge the attacker has. Therefore, it is important to consider the worst-case scenario. This article initiated a formal study of the worst-case background knowledge.

We propose a language that can express any background knowledge about data. We provide a polynomial time algorithm to measure the amount of leakage of sensitive information in the worst case, assuming that the attacker has at most k pieces of information in this language. We also provide an effective way to clean up data so that the amount of data disclosed in the worst case is less than the specified threshold.

The goal in this article isQuantify the precise impact of the attacker's background knowledge on the amount of disclosure, andProvide algorithms to check and ensure that the amount of disclosure is less than the specified threshold

Generally, l-diversity and k-anonymity are protected according to the type of background knowledge that the attacker has. However, if there are other types of background knowledge, the protection will be invalid. Therefore,A more general framework is needed that can capture knowledge of any attributes of the underlying table that an attacker might know

The worst-case leakage risk caused by an attacker with K such information units can be quantified, and k can be regarded as the constraint of the attacker's power.

We will show

  • How to effectively protect privacy by ensuring that the worst-case (ie maximum) disclosure of any k pieces of information is less than a specified threshold
  • It also showed how to integrate our technology into the existing framework to find a "minimum cleanup" table with a maximum public value less than a specified threshold

frame

  • P is a group of (limited) people. For every p ∈ P
  • S to represent the sensitive attribute and its domain
  • Table T, which is a set of tuples corresponding to a subset of P.
    The publisher hopes to publish T in a form that protects any personal sensitive information from attackers.
  • The attacker has background knowledge that can be expressed in the language L.

Two purification methods

  • Bucketing: Divide the tuples in T into buckets, and then separate sensitive attributes from non-sensitive attributes. By randomly replacing the sensitive attribute values ​​in each bucket. The cleaned data consists of buckets with replacement sensitive values.
  • Global generalization: Coarse non-sensitive attribute domains. The cleaned data consists of the coarsened table and the generalization used. Unlike bucketization, the exact value of non-sensitive attributes is not released; only the coarsened value is released.

If the attacker knows the set of persons in the table and their non-sensitive values, then global generalization and bucketing are equivalent.In this article, we use bucketing as a method to construct published data from the original table T.

Concrete bucketing concept

Given a table T, we divide the tuples into buckets (ie, divide the table T according to some scheme levels), and within each bucket, we apply independent random permutations to the columns containing S values. The resulting set of buckets is denoted by B and then released.

For example, if the underlying table T is as shown in Figure 1, then the publisher may publish B as shown in Figure 3.
Insert picture description here
Insert picture description here
Of course, in order to increase privacy, the publisher can completely block the identification attributes (name), or partially block other non-sensitive attributes (age, gender, postal code).

background knowledge

We pessimistically assume that the attacker has enough time to obtain information about which individuals have records in the table. In other words, we assume that the attacker knows the set Pb of people in bucket B for each b ∈ B.

Then, consider knowledge beyond the identifying information that the attacker may possess. We believe that this further knowledge is the knowledge that the underlying table satisfies the given predicate on the table. We believe that this further knowledge is the knowledge that the underlying table satisfies the given predicate on the table. This is a fairly common assumption. For example, "the average age of heart disease patients in the table is 48" may be such a premise. To quantify the power of this knowledge, we usedBasic unit of knowledgeWe propose a language composed of finite conjunctions of this basic unit. Given complete identification information, we hope that any predicate on the table can be represented by the conjunction of the basic unit we proposed. We use a very simple propositional grammar.

Definition 1 (atom):

The form is a formula of tp[S] = s. For example: tJack[disease] = flu, which means that Jack’s tuple has value flu for sensitive attribute diseases.

The basic knowledge unit in the language is the basic meaning.

Definition 2 (basic meaning)

The basic meaning is the formal formula:
Insert picture description here
for some m ≥ 1, n ≥ 1 and atoms Ai, Bj, (note that we use the standard symbol [n] to represent the set {0,...,n1}).

The basic meaning is a fully expressed "basic unit" of knowledge. This fact is made accurate by the following theorem.

Theorem 3 (Completeness)

Given the complete identification information and any predicate on the table, people can use a limited connection of basic meaning to express the knowledge that the basic table satisfies the identification information and the given predicate.

Therefore, we can simulate any powerful attacker. But at the same time, it is necessary to assume that the attacker's power is limited. We model the attacker with limited capabilities by limiting the number of basic meanings that the attacker knows. In other words, the attacker L k basicknows a formula from the language defined below .

Definition 4

Insert picture description here
It is a language composed of k conjunctions with basic meanings. In other words,
Insert picture description here
Therefore, k can be regarded as the limit of the attacker's power, and k can be increased to provide more conservative privacy protection.

Please note that our choice of the basic meaning of the language "basic unit" has an important impact on our assumptions about the power of the attacker. In particular, some attributes of the underlying table may require a lot of basic meaning to express. Since the basic meaning is essentially a CNF clause with at least one negative atom, our language has suffered exponential inflation in the number of basic units required to express any DNF formula. Nevertheless, many natural types of background knowledge have simple expressions using basic meanings.
For example, Alice’s knowledge "If Hannah has the flu, then Charlie has the flu" is only the basic meaning.
Hannah[病] = flu → Charlie[ disease] = flu

The knowledge of "Ed has no flu" is
TEd[disease] = flu→TEd[disease] = ovarian cancer

Generally speaking, we can use (t[S] = s )→( t[S]= s′ to express the negation of t[S]= s, for s≠s′.

One of the contributions of this article is that we provide a polynomial time algorithm to calculate the maximum leakage even if the attacker knows the dependency.

Disclosure

1. Disclosure risk is the possibility of the most highly predicted sensitive attribute Insert picture description here
2. Maximum disclosure: relative to the language that expresses background knowledge, barrelized maximum disclosure
Insert picture description here

To calculate Pr(tp[S] = s|B∧ϕ), the method is to consider the set of all tables consistent with bucketing b and background knowledge ϕ, and then take the scores of those tables that satisfy tp[S] = s. Using this, the maximum disclosure Insert picture description hereresult for the bucket in Figure 3 is 10/19

aims

Our goal is to develop general techniques to:
1. Effectively calculate the maximum disclosure for any given time period, and
2. Effectively find the “minimal purification” bucketing (or all minimum disclosures) where the maximum disclosure is below a specified threshold (if any) Barrelized collection of limited purification).

Check and enforce privacy

In Section 3.4, the concept of "minimum disinfection" is clarified; in order to maintain the usefulness of data, we need "minimum cleanup".

We now show how to effectively calculate and limit the maximum leakage to an attacker who has complete identity information and up to k additional background knowledge (ie, up to k basic implicits).

Definition 7 (simple meaning) The simple meaning is that some atoms A and B have the form A → B.

Note that k simple meanings can be written in 2-CNF (conjunctive normal form), for which satisfiability is easy to check.

Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here

How to find the threshold

With the method of calculating the maximum disclosure, we now show how to effectively find the "minimization" bucketing where the maximum disclosure is below a given threshold. Intuitively, we want a minimal cleanup to maintain the utility of published data.

Guess you like

Origin blog.csdn.net/weixin_42253964/article/details/107101590
Recommended