Privacy protection private information retrieval

[Introduction] User privacy protection involves many aspects, and privacy protection of user behavior is even more difficult. I read a paper over the weekend, https://cacm.acm.org/magazines/2010/4/81501-private-information-retrieval/fulltext, which involves many mathematical methods and concepts. It is very laborious and private information retrieval. Will it be too much? What is the motivation for companies to do this?

The popularity of the Internet means that there is a large amount of online data and resources that are indispensable for retrieving information. To some extent, it also poses a significant risk to user privacy. In fact, users are often wary of accessing public data when they intend to keep it private. For example, a company may wish to conduct searches for certain patents without revealing its identity.

So, how to protect users’ privacy when they perform information retrieval? This may involve a technology called private information retrieval.

58cf24625605842c7110df87e206efa8.jpeg

What is private information retrieval?

Private Information Retrieval is a cryptographic protocol designed to protect the privacy of data users, allowing clients to retrieve records from public databases while hiding the identity of the retrieved records from the data owner. In fact, the possibility of retrieving the data without revealing its identity to the data owner is almost zero. Of course, there is a simple solution: when users need a single piece of data, they can ask for a copy of the entire database. However, this solution involves huge communication overhead and may not be acceptable. This simple solution is optimal for those users who want to completely protect their privacy.

In 1995, the industry proposed a private information retrieval scheme. In the protocol of this scheme, the user queries each server that saves the database to ensure that each individual server cannot obtain identification information about the items of interest to the user.

Private information retrieval schemes are closely related to a special class of error-correcting codes called "locally decodable codes", which are objects of interest in their own right. Error-correcting codes help ensure reliable transmission of information over noisy channels and reliable storage of information on media where access devices are prone to errors. This encoding allows one to add redundancy or bit strings to the message and encode it into longer bit strings so that even if a certain proportion of the bit strings are corrupted, the message can still be recovered. In a typical application of error-correcting codes, the message is first divided into small blocks, and then each small block is encoded separately. This encoding strategy allows efficient random access retrieval of information because only the portion of the data of interest needs to be decoded. Unfortunately, this strategy yields poor noise resilience because, even if a single block is completely corrupted, some information is lost.

Given this limitation, it seems preferable to encode the entire message into a single codeword with forward error correction. This solution improves robustness to noise, but is difficult to satisfy since the entire codeword needs to be viewed in order to recover any specific bit of the message. This decoding complexity is not possible with today's large-scale data sets.

The private information retrieval scheme provides efficient random access retrieval and high noise recovery capabilities, allowing reliable reconstruction of arbitrary bits of information by looking at only a small number of randomly selected codeword bits.

1c87e839520e69d37849e87ca8a2cfb7.jpeg

A first introduction to private information retrieval

If the data is modeled as an n-bit string X, the string is replicated only among a small number of servers S1,...,Sk. The user holds an index i (an integer between 1 and n) and is interested in getting the value of bit Xi. To achieve this goal, the user randomly queries each server and receives the response from which the required location Xi is calculated. Queries to each server are distributed independently of i, so each server does not have information about what the user needs.

The user's queries are not necessarily requests for a specific single set of data; they specify functions computed by the server; for example, a query might specify a set of indexes between 1 and n, and the server's response might be stored in those indexes. XOR of data bits.

The main parameter of private information retrieval schemes is communication complexity, or a function that measures the total number of bits communicated between the user and the server. The communication complexity of the current most effective dual-server private information retrieval protocol is O (n raised to the 1/3 power). However, private information retrieval schemes involving three or more servers have been improved.

Hadamard encoding allows ultra-fast recovery of message bits at the expense of very large code lengths. For example, given a 10% corrupted encoding, reading just two bits of the code will recover any bit of the message with an 80% probability. This means that each bit Xi of the message can be recovered from k-tuples of many different codeword bits. Therefore, the distribution of each query of the decoder must be somewhat close to a uniform distribution over the coded bits.

The verification protocol is proprietary and very simple, since for each j in [ k ], the query Qj is evenly distributed over the set of codeword coordinates, and the total communication volume is given by k (logN + 1).

0b5eed648f8cd7eaabf14f0b0c881281.jpeg

Early private information retrieval

The goal of the private information retrieval scheme is to access n-bit data using O (n 1/d power) communication by providing a simple (d + 1) server scheme. The key idea behind this scheme is finite polynomial interpolation.

Assuming that p > d is a prime number, the addition and multiplication of {0,...,p1} modulo p satisfy the standard identities on real numbers. That is, the numbers {0,...,p1} form a finite field with respect to these operations. This field is represented by Fp. Polynomials defined over finite fields are treated below. This polynomial has all the algebraic properties of a real polynomial. Specifically, the value of a univariate polynomial at any point d + 1 uniquely determines its polynomial over Fp of d.

Let m be a large integer. Let E1,...,En be a set of n vectors on m-dimensional Fp. This set is fixed and independent of the n-bit database x. Assuming that both the server and the user know the set, in the preprocessing stage of the private information retrieval protocol, the server on each (d + 1) represents the data x with the same degree of d polynomial f in m variables. The key property of this polynomial is that for every i in [n]: f (Ei) = xi. In order to ensure the existence of such a polynomial f, m is chosen to be relatively large relative to n. Generally, setting m = O (n1/d) is sufficient.

Suppose the user wants to retrieve the i-th position of the database and knows the set of vectors E1,...,En. Therefore, the user's goal is to recover the value of Ei's polynomial f (held by the server). Obviously, the user cannot explicitly request the value of f (Ei) from any server, since such a request would violate the privacy of the protocol; that is, some server would know which data bit the user requires. Instead, the user indirectly gets the value of f (Ei). In particular, the user generates a random set of m-dimensional vectors P1,...,Pd + 1 on Fp, like this:

Each vector P is uniformly random and therefore provides no information about Ei;

The value of any degree d polynomial (including polynomial f) in P1,..., Pd + 1 determines the polynomial in Ei.

The user sends a vector P1,...,Pd+1 to each server. The servers then compute the polynomial f at the vectors they received and return the values ​​they obtained to the user. The user combines the values ​​f (P1), ... , f (Pd + 1) to get the desired value f (Ei). The protocol is completely private, and communication is equivalent to sending a (d + 1) vector of dimension m to the server and returning a value to the user.

1d1a16a78dd128e2e88baf8f9f586c4c.jpeg

Modern private information retrieval

Modern private information retrieval schemes are no longer based on polynomials, and their key technical element is the design of a large set family with restricted intersection. Let k be a small integer that encodes n-bit messages into codewords. This construction consists of two steps: the first step is to construct a simplification of the problem of a family of sets with restricted intersection; the second step is the algebraic construction of the desired family of sets.

step 1:

C is the F2 linear map. For any two messages x1, x2 in Fn2, there is C (x1 + x2) = C (x1) + C (x2) , where the sum of the vectors is calculated modulo 2 in each coordinate;

The decoding algorithm works by reading some k-tuple coordinates of the corrupted codeword and outputting the exclusive OR (XOR) of the values ​​in these coordinates. For i in [n], let Ei represent a binary n-dimensional vector whose only nonzero coordinate is i. Each linear mapping allows a combinatorial description. That is, for each i in [ n ] specify:

A set of Ti coordinates for C (Ei), set to 1. These sets fully specify the encoding, since for any message Pick. Certain combinatorial constraints must be satisfied, and the basic rationale for these constraints is as follows:

Decoding must be correct to avoid encoding bits being corrupted. This means that for every i, j in [n] and any k set therein, if i = j, the size of STj must be an odd number, otherwise it is an even number;

The distribution of individual queries of the decoding algorithm must be close to uniform. This means that for each i in [n], the union of k sets therein must be large relative to the number of encoded coordinates.

Step 2:

Design sets Ti and Qi that satisfy these constraints. This structure is supported by geometric intuition. The bidirectional mapping between the set of coded coordinates and the set of m-dimensional vectors on the finite field of base k is considered. In the m-dimensional linear space on Fk, select the set Ti as the union of some parallel hyperplanes, and use basic algebra to discuss the size of the intersection.

Computationally private information retrieval schemes are attractive because they avoid the need to maintain replicated copies of databases and do not cause harm to user privacy.

in conclusion

In recent years, private information retrieval has grown into a large and deep field and is connected to other fields. Private information retrieval mainly involves two aspects, on the one hand, the complexity of communication, and on the other hand, the amount of computation that the server must perform in order to respond to user queries.

[Related reading]

Guess you like

Origin blog.csdn.net/wireless_com/article/details/132013917