Utility-Privacy Tradeoffs in Databases: An Information-Theoretic Approach

introduction

Insert picture description here

As shown in Figure 1, during a legal transaction, the user learns some public information (for example, gender and weight) that is allowed and needs to be supported to make the transaction meaningful. He can also learn/infer private information (for example, cancer diagnosis and income), which is information that needs to be prevented (or minimized). Therefore, every user of the data (possibly) is also an opponent. Multiple research groups have studied privacy and information leakage issues for decades. The methods of information theory to solve problems are few and far apart, and mainly focus on the use of information theory measures. However, the strict information theory treatment of the utility-privacy (UP) trade-off problem is still open, and the following issues have yet to be resolved: (i) statistical assumptions of data that allow information theory analysis, (ii) display to different users Ability to provide private information at different levels, and (iii) model and explain existing knowledge. In this work, we seek to apply information theory tools to solve the open problem of providing rigorous UP trade-off analysis and characterization. If people treat the public and private attributes of the data in the repository as random variables with a joint probability distribution, the private attributes in the database will remain private, so that revealing the public attributes will not release other information about the attributes, in other words In other words,Minimizing the risk means that the loss of privacy means that the conditional entropy of private attributes should be as high as possible after the disclosure. therefore,In Figure 1, keeping cancer attributes private means that if the public attributes of gender and weight are known, the predictability of cancer attributes should remain unchanged. For this reason, the gender attribute in item 1 has been "sanitized".

The usefulness of a data source lies in its ability to disclose data, so privacy considerations may compromise usefulness. In fact, in this case, utility and privacy are competing goals. In order to make reasonable trade-offs, we need to know the maximum utility that can be achieved under a given privacy level, and vice versa, that is, to analyze and characterize the set of all achievable UP trade-off points. We show that this can be done with a fine tool in information theory (namely rate distortion theory): utility can be quantified by fidelity, which is inversely related to distortion. Rate distortion must be increased by privacy constraints quantified by entropy, which is related to entropy.

Our contribution:The main contribution of this work is the use of rate-distortion theory and additional privacy rights to accurately quantify the trade-off between the privacy needs of individuals represented by data and the usefulness of cleaned (published) data from any data source. Utility is quantified by distortion (accuracy) (opposite), and privacy is quantified by ambiguity (entropy) . For the first time, we reveal the basic dimensions of information disclosure through additional restrictions on the disclosure rate (that is, measuring the accuracy of cleaned data). Any controlled disclosure of public data needs to specify the accuracy and accuracy of the disclosure; although the additional noise of digital data can be used to mix the two, the additional noise is not classified data (social security number, zip code, disease status, etc.) Option, so it becomes important to specify the output precision. For example, in Figure 1, the weight attribute is a numeric field, which may be distorted by random additive noise, or it may be truncated (or quantized) to the range of 90-100, 100-110, etc. The Social Security Number (SSN) that identifies and protects the privacy of students in transcripts is a familiar non-numeric example. Disinfection (of the complete SSN) is achieved by reducing the accuracy of the heuristic to the usual last four digits. The ideal is a theoretical framework that formally specifies the necessary and sufficient output accuracy to achieve the best UP trade-off. In [1], a rate-distortion equivalent (RDE) trade-off for a simple source model is proposed. We transformed this formalism into an UP problem and developed a framework that enables us to model common data sources, including multidimensional databases and data streams [2], develop abstract utility and privacy indicators, and quantify UP’s Basic trade-off characteristics. Then, we propose a disinfection scheme that realizes the UP trade-off area, and demonstrate the application of the scheme in numerical and classification examples. Note that the associations available to the user/adversary can be internal (that is, between variables in the database) or external (with variables outside the database but accessible to the user/adversary),

Our example illustrates two basic aspects of our framework: (i) How the statistical model of data and UP metrics reveals appropriate data distortion and suppression to achieve privacy and utility guarantees; (ii) Understanding how source statistics determine the maximum UP Optimal disinfection mechanism to determine the largest UP trade-off area. The structure of this article is as follows. In the second part, we briefly outline the latest state of database privacy research. In the third section, we stimulated the demand for information theory analysis and presented the intuition behind our analysis framework. In Section 4, we provide an abstract model and metrics for structured data sources (such as databases). We will develop the main analytical framework in Section 5,

Related work

All these techniques such as K-anonymity have proven to be non-universal because they are only effective against a limited number of opponents. The privacy of an individual in the database is defined as the limitation of any opponent's ability to accurately detect whether the personal data belongs to the database.

The concept of DP is strictly better than our definition of privacy based on Shannon entropy. However, our model seems to be more intuitively accessible and suitable for many application domains that do not require strict anonymity. For example, in many health databases, the existence of personal records is not a secret, but the individual's disease state is a secret.

Our disinfection method is suitable for digital data and classified data, and although DP is a very popular privacy model, it seems to be limited to digital data. In addition, the loss of utility caused by DP-based disinfection may be significant [13]. Some work has pointed out the loss of utility programs due to the privacy mechanism of specific applications [14].

More generally, a strict utility-privacy trade-off model and a way to achieve all the best points have been open, which is the subject of this article. The use of information theory tools for privacy and related issues is relatively rare. [1] used rate-distortion theory with distortion constraints to analyze a simple two-variable model, which is the main motivation for this work. In addition, some recent work compares differential privacy guarantees with Renyi entropy [15] and Shannon entropy [16].

Motivation and background knowledge

The database privacy method in information theory involves two steps: the first step is the data modeling step, and the second step is to derive the mathematical formalism for disinfection. Before introducing formal models and abstractions, we first propose an intuitive understanding and motivation for the following methods.

Motivation: Statistical Model

Our work is based on the observation that large data sets (including databases) have a distribution basis; that is, there are basic (and sometimes implicit) statistical models for the data. Even in the case of data mining where only one or a few data set instances are available, the use of correlations between attributes will use implicit distribution assumptions about the data set. We explicitly model the data as data generated from sources with finite or infinite letters and known distributions. Each row of the database is a collection of (personal) related attributes. These attributes belong to the source letter and are generated based on the occurrence probability of the letter (letter).

Our database statistical model is also inspired by the fact that although a person’s attributes may be related (for example, between weight and cancer attributes in Figure 1),The records of a large number of individuals are usually independent or weakly related to each other. Therefore, we model the database as a collection of observations generated by memoryless sources, The output of the memoryless source is independent and evenly distributed.

We use conditional entropy to quantify privacy. Intuitively, privacy refers to maintaining the uncertainty of unclearly disclosed information .

Low probability/high information samples (outliers) are suppressed or severely distorted, while high probability (frequently occurring) samples are only slightly distorted. As we formally demonstrated in the sequel, our methods and solutions for classifying databases capture the key aspect of privacy challenges, which is to suppress high information (low probability outliers) and distort all other information (to the desired level) Utility/distortion level))

Our recommended cleaning process is to determine the output (database) statistics that achieve the required level of utility and privacy, and determine which input values ​​to interfere with and how to interfere with them probabilistically. Since the output statistics depend on the disinfection process, the source model considered here, mathematically speaking, the problem is reduced to finding the probability of the symbol transition from input to output.

Background knowledge: rate distortion theory

For the purpose of privacy modeling, the attributes of any individual in the database are divided into two categories: public attributes that can be disclosed and private attributes that need to be hidden. An attribute can be public and private at the same time. Anyone’s attributes are related; this means that if public attributes are exposed as is, users can use the correlation model to infer information about private attributes. Therefore, ensuring the privacy of private attributes (hidden attributes) requires modifying/clearing/distorting public attributes. However, common attributes have utility constraints that limit distortion,

Our method is to determine the best cleanup, that is, a mapping that in the set of all possible mappings of the public properties of the conversion database, guarantees the maximum privacy of the private properties of the expected utility level for the public properties. We use the terms encoding and decoding to refer to this mapping on the data publisher side and the user side, respectively. A database instance is an implementation of a random source (when the number of attributes is large, the source is a vector), and can be regarded as a point in a dimensional space (see Figure 2). The set of all possible databases (-length source sequences) that can be generated using source statistics (probability distribution) are located in this space. The utility metric we choose is to measure the average "tightness" between the public attributes of the original database and the public database through distortion requirements. Therefore, the output of the sanitization will be another database within the radius (another point in the same dimensional space). We are trying to determine a set of output databases, the meaning of one of the parameters will be discussed below, to "cover" the space, that is, given any input database instance

Guess you like

Origin blog.csdn.net/weixin_42253964/article/details/107719072