Differential Privacy

The origin of differential privacy

  To protect the privacy of users in a statistical database, the ideal definition of privacy is this: access to a statistical database without revealing information about individuals in the database. That is to say, the statistical database should provide a statistical value, but information about individuals should not be queried.

  However, this ideal definition is not feasible, it does not take into account auxiliary information. Take this example: a database of the heights of women in a certain region. The average value can be queried from the database. In addition, according to the auxiliary information, you know that Alice's height is 2cm higher than the average height, then you can get Alice's height, that is, Alice's height information has been leaked.

  We settle for the next best thing and define privacy in a weak, but practical way: the risk of a person's privacy breach should not be increased by the addition of that person's information to a statistical database. This definition is differential privacy.

 

Definition of Differential Privacy

  Given a random algorithm K, if for any sibling table $T_1$ and $T_2$, and any output $S\subseteq Range(K)$ satisfy:

$Pr[K(T_1) \in S] \leq  e^{\epsilon} \times Pr[K(T_2) \in S]  $

即 : $ \ frac {Pr [K (T_1) \ in \ S]} {Pr [K (T_1) \ in \ S]} \ leq e ^ {\ epsilon} $

Then the algorithm K satisfies $\epsilon$ differential privacy. Let me explain this definition below:

The first is algorithm K, which is a random algorithm. A random algorithm means that the output of the algorithm is random, so to describe it, use the knowledge in probability, such as the probability density function, the output probability of the algorithm, etc.

$T_1$ and $T_2$ are sibling data tables, which means that there is only one record in the data table, that is, there is user information in one data table, and there is no user information in the other table. This is to correspond to the above definition of privacy, the risk of a person's privacy breach should not be increased just because the person's information is in the database.

$S\subseteq Range(K)$ instead of $S = Range(K)$, because in the probability density function to determine the probability of an event, a range should be used instead of a point, where the probability is always 0, here means that the output is in a range.

Therefore, the above function is measured by the method of probability, which ensures that when K is on the sibling table, the outputs in all ranges are very close.

For example, in the Laplace distribution below, it must be ensured that the outputs of the two data tables are very close to the distribution of the entire algorithm.

Figure_1-2

sensitivity

  Privacy is a measure of a function. For a function $f: D \rightarrow R^d$, where D is the database, the function performs a query on the database and returns a d-dimensional vector, the L1 sensitivity is defined as follows:

  $S(f) = \max \limits_{D_1, D_2} {\Vert f(D_1) – f(D_2) \Vert }_1$

When the result returned by the function f is a number, that is, $f: D \rightarrow R$, then the L1 sensitivity is:

  $S(f) = \max \limits_{D_1, D_2} \vert f(D_1) – f(D_2) \vert $

For example, query function: how many records meet certain conditions. Then the result returned by this function is a number, and its sensitivity $S(f) \leq 1$, that is: when none of the query results are satisfied, the query sensitivity is 0, and when one or more of the query results satisfy , the sensitivity is 1.

 

The relationship between $\lambda , \epsilon and S(f) $ in Laplace noise

Let us explain again that these three parameters $\lambda$ are important parameters in the Laplace distribution , which determine the variance of the distribution

$\epsilon$ is a measure used to control the degree of privacy in the definition of differential privacy

$S(f)$ is the sensitivity of the function we define

We know that the output of function f on database D is f(D), and the probability density after adding Laplace noise is $\frac{1}{2\lambda} exp(- \frac{\vert xf(D )\vert}{\lambda})$.

Then it is at a point a, and the probability obtained is proportional to the probability density of the point, $Pr[K_f(D) =a] \propto exp(- \frac{\vert f(D)-a\vert}{\lambda })$

Where $K_f(D)$ represents the value of the output of the function f on the data D after being processed by the random function K.

 

Then for sibling databases D and D', there are $\frac{Pr[K_f(D_1) =a] }{Pr[K_f(D_2) =a] } = \frac{ exp(- \vert f(D_1)-a \vert / \lambda) }{exp(- \vert f(D_2)-a\vert / \lambda)} = exp( \frac{ \vert f(D_2)-a\vert - \vert f(D_1)- a\vert }{\lambda} )$

Then according to the absolute value inequality (triangular inequality?) $\lvert a \rvert – \lvert b \rvert \leq \lvert a – b \rvert$, you can get

$exp( \frac{ \vert f(D_2)-a\vert - \vert f(D_1)-a\vert  }{\lambda} ) \leq  exp( \frac{\lvert  f(D_1) – f(D_2) \rvert  }{\lambda} )  = exp( \frac{S(f)  }{\lambda} )$

It can be seen that if the function f adds the Laplace noise with the parameter $\lambda$, the differential privacy of $\frac{S(f)}{\lambda}$ can be satisfied,

Similarly, if the function f adds the Laplace noise with the parameter $\frac {S(f)}{\epsilon}$, the differential privacy of $\epsilon$ can be satisfied,

 

Histogram Differential Privacy

  The characteristics of the histogram are as follows: all data are divided into equal-width squares, and modifying a record in the database will only affect the data in one square, so the query sensitivity of the histogram is 1. Therefore, when the histogram is published, the Laplace noise of $1/\epsilon$ can be directly added to satisfy the differential privacy of $\epsilon$.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324575055&siteId=291194637