PSI Algorithm Classic Paper Algorithm Overview

What is PSI

Privacy intersection is a cryptographic technology in multi-party secure computing, which allows data holders to calculate the intersection by comparing encrypted sets, and neither party will obtain other information. There is also a variant of PSI, the CS scenario . The client can obtain its intersection with the server but the server cannot learn the set. If comparing data sets via cryptographic hashes on a small, predictable domain, precautions should be taken to prevent dictionary attacks .

PSI also has many application cases in daily life. Apple uses this technique in password monitoring . It has proposed the technology for its announced expanded protections for children .

PSI protocol classification

Depending on how the data is hosted, PSI agreements can be divided into two broad categories: traditional PSI and delegated PSI.

  1. In the traditional PSI category, data owners directly interact with each other and need to have a copy of their collection when computing, refer to the paper " Efficient Private Matching and Set Intersection ".
  2. In delegated PSI, the computation of PSI and/or the storage of aggregates can be delegated to a third-party server (which itself may be a passive or active adversary). The order PSI classes can be further divided into two categories: (a) those that support one-time orders, and (b) those that support recurring orders. The PSI protocol, which supports one-time delegation, requires data owners to re-encode their data and send the encoded data to the server for each computation. Those that support repeated delegation allow data owners to upload their (encrypted) data to the server only once, and then reuse it multiple times in each computation outside the server.

Recently, researchers have proposed a variant of the PSI protocol that supports data updates (including traditional categories and delegated categories. This type of PSI protocol allows data owners to insert/delete collection elements into/delete their data in a low-overhead and privacy-preserving manner. Reference papers "Updatable Private Set Intersection", "Multi-party Updatable Delegated Private Set Intersection " .

Classification of PSI Algorithms

The PSI algorithm can be roughly divided into the following four categories.

  1. PSI algorithm based on hash function : This type of algorithm uses a hash function to hash the elements in the data set, and sends the hash value to other participants for comparison. One of the common algorithms is Bloom Filter, which uses multiple hash functions to map elements into a bit vector. This type of method is an insecure intersection protocol. When the input fields of both parties are small, there is a risk of dictionary attacks.

  2. PSI algorithm based on oblivious transfer (OT) : This type of algorithm uses zero-knowledge proof or similar technology, so that a party can obtain a specific element owned by another party without knowing the information of other elements. The OT-based PSI algorithm was first proposed in " Faster Private Set Intersection Based on OT Extension " in 2014 . Among the secure PSI algorithms, the OT-based PSI algorithm is the fastest, but the communication cost is not the lowest.

  3. GC-based PSI algorithm : This algorithm uses the Garbled Circuit (GC) protocol as the core method, and combines the OT protocol to realize PSI. Each element in the intersection of such protocols will carry a "payload" (payload), which can be used to calculate some functions of the intersection. However, the performance of GC-based schemes still lags far behind other types of schemes. With the increase of data scale, the circuit depth continues to increase, resulting in rapid expansion of circuit scale, communication, computing, and memory overhead. Although not as fast and efficient as the OT-based PSI algorithm, the GC-based PSI scheme is more flexible and can be easily adapted to compute variants of the set intersection function.

  4. PSI algorithm based on public key encryption : In the PSI algorithm based on public key, participants usually use asymmetric encryption algorithms, such as RSA, elliptic curve or homomorphic encryption technology, to encrypt and decrypt the original data in the privacy set to find common elements. The PSI algorithm based on the public key is a relatively secure PSI algorithm with the least communication consumption.

  5. PSI algorithm based on differential privacy : In order to prevent the privacy leakage of PSI results, a method for intersection to meet differential privacy is proposed, that is, a certain required noise is added to PSI results, so as to ensure that the adversary cannot speculate on the data set through the intersection results. Intersection satisfying differential privacy is widely used in data analysis scenarios, such as social network analysis, medical research, user behavior analysis, etc. The PSI scheme based on differential privacy provides security protection for the results in the aforementioned PSI algorithm, and has stricter privacy protection. The privacy protection and data accuracy requirements can be balanced by adjusting the noise parameters on demand.

PSI Algorithm Based on Hash Function

The PSI (Private Set Intersection) algorithm based on the hash function mainly includes the following common algorithms:

  1. Bloom Filter : Bloom Filter is a classic PSI algorithm based on a hash function. It uses multiple hash functions to hash elements in a dataset and maps the hash values ​​into a bit vector. By comparing bit vectors, elements of intersection between two datasets can be determined. Refer to " Private set intersection with Bloom filters " for details.

  2. Count-Min Sketch : Count-Min Sketch is a probabilistic data structure commonly used in PSI algorithms based on hash functions. It uses multiple hash functions to map elements into a two-dimensional count table, and by accumulating the counts, the frequency of occurrence of elements can be estimated. In PSI, Count-Min Sketch can be used to judge whether there is an intersection between two data sets. Reference " An improved data stream summary: the CountMin sketch and its applications ".

  3. Cuckoo Filter : Cuckoo Filter is a data structure that approximates set membership detection and can also be applied to the PSI algorithm based on hash functions. It uses two hash functions to map the elements into a hash table, and checks the bits in the hash table to determine whether the element exists. Reference " Efficient Circuit-based PSI via Cuckoo Hashing ".

  4. MinHash : MinHash is an algorithm for approximately calculating the similarity of sets, which can also be used for PSI. It hashes the elements, selects the minimum value in the hash value, forms the minimum value into a signature, and judges the intersection between the sets by comparing the signatures. Reference " EsPRESSo: Efficient Privacy-Preserving Evaluation of Sample Set Similarity∗ ".

The hash-based PSI algorithm requires random access operations at any time, so all filters need to be stored in memory, so it will take up a certain amount of memory space . The space occupation of PSI algorithm based on cuckoo hash is better than that of PSI algorithm based on Bloom filter.

PSI Algorithm Based on Overt Transfer (OT)

The OT-based PSI algorithm flow is as follows:
Suppose there are two sets AAA andBBB , of whichAAA inclusion elementsa 1 , a 2 , . . . , an a_1, a_2, ..., a_na1a2...an B B B contains elementsb 1 , b 2 , . . . , bm b_1, b_2, ..., b_mb1b2...bm. Using the OT method to calculate PSI is equivalent to solving mmm ( N 1 ) O T \binom{N}{1} OT (1N) OTproblem, whereBBElement bi b_iin BbiIt is the selection queue every time.

OT-based PSI algorithms can be divided into the following categories:

  1. OT PSI algorithm based on hash technology : this algorithm uses hash function to generate parameters required by OT, and uses them to execute OT protocol, this algorithm can reduce the number of communication to O ( nlogn ) O(nlogn)O ( n l o g n ) . ReferencePhasing: Private Set Intersection using Permutation-based Hashing.

  2. OT PSI algorithm based on Pseudo Random Function (Pseudo Random Function, PRF) : This algorithm uses PRF to generate the parameters required by OT, and then uses these parameters to execute OT protocol to realize PSI. This algorithm is KKRT PSI algorithm. Reference 2016 " Efficient batched oblivious PRF with applications to private set intersection ", the algorithm is to replace the components in " Phasing: Private Set Intersection using Permutation-based Hashing " with BaRK-OPRF (derived from IKNP OT), thus improving the efficiency of long data and large data sets, which is 2.3 to 3.6 higher than that of times.

  3. OT PSI algorithm based on the commitment scheme (Commitment Scheme) : This algorithm uses the commitment scheme to achieve random selection, and then uses these random numbers to execute the OT protocol to achieve PSI. References " Malicious-Secure Private Set Intersection via Dual Execution " ( this algorithm is currently the fastest PSI algorithm under the malicious adversary model ), " Actively Secure 1-out-of-N OT Extension with Application to Private Set Intersection ".

  4. PSI algorithm based on Bloom Filter : Bloom Filter is a data structure that can efficiently determine whether an element is in a collection. This type of algorithm uses Bloom Filter to store the hash value of the collection elements, thereby reducing communication and computing costs. The idea was first proposed by " Fast private set operations with sepia ". In " Outsourced private set intersection using homomorphic encryption ", the Bloom Filter is combined with the homomorphic encryption method. In order to solve the security and efficiency problems in the above two articles, " When Private Set Intersection Meets Big Data: An Efficient and Scalable Protocol " proposes an Oblivious Bloom Filter algorithm. In this algorithm, the client uses Bloom Filter (BF) to encode its private set, and the server uses GBF (Garbled Bloom Filter) to encode its private set, and then performs the intersection operation through the OT protocol. There are standard and enhanced versions of the algorithm, which can be extended from semi-honest adversary models to adversary models. The algorithm has good efficiency, computing the intersection of sets of 2 million elements takes only 41 seconds (80-bit security) and 339 seconds (256-bit security) in parallel mode on moderate hardware. In " Quantum private set intersection cardinality based on bloom filter ", a PSI algorithm based on Bloom Filter that is resistant to quantum attacks is also proposed.

The hardware configuration of " When Private Set Intersection Meets Big Data: An Efficient and Scalable Protocol ": the server is Mac Pro, equipped with 2 Intel E5645 6-core 2.4GHz CPUs, 32 GB RAM, running Mac OS X 10.8. The client is a Macbook Pro laptop with Intel 2720QM Quad Core 2.2 GHz CPU, 16 GB RAM, running Mac OS X 10.7. The two computers are connected via 1000M Ethernet.

GC-based PSI Algorithm

The GC-based PSI algorithm was first proposed in " Private set intersection_Are garbled circuits better than custom protocols? " in 2012 . The scheme is based on the semi-honest adversary model. The main idea of ​​the GC-based PSI algorithm proposed in this paper is that each party sorts their sets locally and privately merges their sorted sets into a sorted list. Each adjacent pair of elements is then compared inadvertently, and if the elements in that pair are equal, the value is kept, otherwise it is replaced with a dummy value. Finally, the resulting list of matched elements is visibly scrambled before the entire list is displayed. This shuffling step is necessary because otherwise information about the position of matched elements leaks information about unmatched elements in the parties' collections. In this process, the GC process can be regarded as a black box.

Due to the different comparison and shuffling steps of the GC-based PSI algorithm, the corresponding complexity is also different.

  • The comparison algorithm Bitwise-AND (BWA) is only applicable to small orders of magnitude.
  • Comparison algorithm Pairwise-Compare (PWC) protocol, the worst case complexity is Θ ( n 2 ) Θ(n^2)Θ ( n2 ), wherennn is the input dataset size.
  • Inadvertent shuffling algorithm Sort-Compare-Shuffle, in the case of small orders of magnitude is Θ ( n log ⁡ n ) Θ(n\log n)Θ ( nlogn ) . The main idea of ​​the algorithm is that parties sort their sets locally and then (privately) merge their sorted sets into a single sorted list.
Protocol Number of Non-Free Gates
BWA 2 p 2^p2p
PWC ( ( 2 n − n ^ ) 2 + n ^ ) ( σ − 1 ) / 4 ((2n-\hat{n})^2+\hat{n})(σ-1)/4(( 2 nn^)2+n^ )(p1)/4
Sort-Compare-Shuffle-SORT 2 σ nlog ( 2 n ) + ( ( 3 n − 1 ) σ − n ) + 2 σ nlog 2 ( 2 n ^ ) 2\sigma nlog(2n)+((3n−1)\sigma −n) + 2\sigma nlog^2(2\hat{n});2 σn l o g ( 2 n )+(( 3 n1 ) pn)+2σnlog2(2n^)
Sort-Compare-Shuffle-HE 2 σ nlog ( 2 n ) + ( ( 3 n − 1 ) σ − n ) + ( σ + 32 ) n 2\sigma nlog(2n)+((3n−1)\sigma −n) + (\sigma +32)n2 σn l o g ( 2 n )+(( 3 n1 ) pn)+( p+32)n
Sort-Compare-Shuffle-SORT 2 σ nlog ( 2 n ) + ( ( 3 n − 1 ) σ − n ) + σ ( n log ⁡ n − n + 1 ) 3 2\sigma nlog(2n)+((3n−1)\sigma −n) + \frac{\sigma(n\log n-n+1)}{3}2 σn l o g ( 2 n )+(( 3 n1 ) pn)+3s ( nlognn+1)

In GC, the calculation process needs to be converted into a series of gates, and the gates can be Free Gates and Non-Free Gates. Free Gates means that the calculation result is obtained directly from the ciphertext without the need for a decryption key, so the calculation cost is extremely low. Non-Free Gates represent higher computing costs. The more Non-Free Gates, the higher the computing cost and the greater the amount of computing required. Therefore, the number of Non-Free Gates represents the computing efficiency and speed of GC. Therefore, the efficiency and speed of GC can be optimized by controlling the number of Non-Free Gates in GC.

In order to solve the problem of large communication overhead, calculation and memory overhead in the GC-based PSI algorithm, the following improvement schemes are proposed:

  1. Hash-based improved algorithm : This type of algorithm uses a hash function to map the elements in the set to a fixed-length hash value, and only compares the hash value when comparing, thereby reducing communication and computing costs. For example, in the article "Phasing: Private Set Intersection using Permutation-based Hashing" in 2015, the above-mentioned inadvertent shuffling process was improved by applying Phasing (using permutation-based hashing to reduce the bit length of representation), reducing the number of Non-Free Gates and the depth of circuit gates, making the efficiency more than 5 times faster than the Sort-Compare-Shuffle algorithm . In 2018, "Efficient Circuit-Based PSI via Cuckoo Hashing" optimized the Cuckoo hash algorithm and expanded the participants to multiple parties.

  2. OT-based improved algorithm : In 2019, Benny Pinkas et al. proposed a GC-based PSI algorithm with linear complexity through the article " Efficient Circuit-based PSI with Linear Communication ". The algorithm is based on the use of a protocol for computing oblivious programmable pseudorandom functions (OPPRFs), amortizing the cost of multiple calls to OPRFs such that the communication cost is linear.

PSI Algorithm Based on Public Key Encryption

DH-based PSI algorithm

The DH-based PSI algorithm is the best PSI protocol under limited communication conditions, refer to " Enhancing privacy and trust in electronic communities " (1999). The algorithm flow is that both parties calculate the shared key value of each privacy element in their collection, and form a hash of the shared key value, and the receiver compares it to obtain the corresponding intersection. " Private set intersection with ECDH " (2020) maps each element in the privacy set to a point on the elliptic curve and then calculates the corresponding shared key hash to obtain the corresponding intersection. The ECDH algorithm uses elliptic curves and requires smaller private keys to achieve the same level of security.

PSI Algorithm Based on RSA Blind Signature

In " Practical Private Set Intersection Protocols with Linear Computational and Bandwidth Complexity " in 2009 , the PSI algorithm based on RSA blind signature was proposed. Based on the above-mentioned DH-PSI algorithm, the algorithm blinds and signs the privacy elements through the blind signature mechanism to calculate the privacy set intersection.

PSI Algorithm Based on Homomorphic Encryption

Homomorphic encryption is an encryption technique that allows computation in the ciphertext state without decryption. These algorithms use homomorphic encryption to encrypt data and perform set intersection operations in the encrypted state. Representative homomorphic encryption algorithms include PSI algorithms based on Paillier, BGV, and BFV homomorphic encryption.

  1. The first use of homomorphic encryption for PSI was in " Efficient Private Matching and Set Intersection " in 2004. This paper proposes a PSI protocol based on homomorphic encryption technology and balanced hashing, which can resist the two-party malicious adversary model. Under the semi-honest adversary model, in this algorithm, the two parties first expand each item in the data set through a polynomial to obtain the encrypted value P ( y ) P(y)P ( y ) , then generate a random perturbationrrr , and finally obtain the encrypted value E nc ( r ⋅ P ( r ) + y ) Enc(r\cdot P(r)+y)of each item through homomorphic additionEnc(rP(r)+y ) . Under the malicious security model, both the client and the server add akc k_ckcand ks k_sksRoot verification to fight against malicious clients or servers.

  2. In 2010, " Efficient Set Operations in the Presence of Malicious Adversaries " proposed a scheme based on homomorphic encryption and zero-knowledge proof, which can guarantee the privacy of the set and realize efficient set operations even in the presence of malicious attackers.

  3. " Fast Private Set Intersection from Homomorphic Encryption " (2017) proposed an efficient private set intersection algorithm based on homomorphic encryption, which uses Paillier homomorphic encryption algorithm to encrypt set elements, and uses homomorphic addition and homomorphic multiplication to realize set intersection calculation. The algorithm optimizes the homomorphic encryption algorithm by combining batch processing with cuckoo hashing and permutation-based hashing. At the same time, the circuit depth of homomorphic encryption is reduced from the original O ( log ⁡ N x ) O(\log N_x) through windowing and partitioning techniques .O(logNx) is reduced to:

    where N x N_xNxis XXX data set size,BBB is the size of the data set after segmentation,α \alphaα is the subset size after circuit segmentation.

  4. " Outsourced Private Set Intersection Using Homomorphic Encryption " (2017) securely outsources set intersection calculation tasks to third-party computing service providers through homomorphic encryption technology. This algorithm has low computing and communication overhead, but it needs to trust third-party computing service providers, and it needs to solve data privacy protection and security issues during implementation.

  5. " Private set intersection with linear communication from general assumptions " (2019) proposes a linear communication private set intersection algorithm based on generalized assumptions, uses homomorphic encryption to implement encryption and calculation, and introduces new technologies to reduce communication complexity to a linear level.

  6. " Labeled PSI from Fully Homomorphic Encryption with Malicious Security " (2018) and " Labeled PSI from Homomorphic Encryption with Reduced Computation and Communication " (2021) respectively implement the PSI algorithm based on hierarchical full homomorphism. The bottom layer of the two adopts the marked PSI (LPSI) technology , which adds a mark to each data item, and then uses HE technology to encrypt. LPSI technology can be applied to targeted price discrimination, key retrieval in mobile communication, etc. Both use the leveled FHE BFV or BGV algorithm (SEAL library). Article 2 improves the SIMD packaging technology, uses the Paterson-Stockmeyer algorithm to reduce the computational complexity, and modifies the window technology to reduce the communication cost, thereby improving the computational efficiency.

PSI Algorithm Based on Differential Privacy

Differential privacy was first proposed by Cynthia Dwork of Microsoft Research in "Differential Privacy" in 2006. The purpose of this paper is to prevent malicious adversaries from predicting other private data information from the histogram and K-anonymization results, thereby causing the privacy of the database to leak .

2012年,《DJoin: Differentially Private Join Queries over Distributed Databases》首次将差分隐私应用用于PSI中,该论文提出的DJoin 可以支持许多 SQL 风格的查询,包括由不同实体维护的数据库的联接,只需要将原SQL转换为其对应的原语表达即可:BN-PSI-CA(私有集交集基数的差分私有形式)和 DCR(多方组合运算符,可以聚合噪声基数而不复合各个噪声项。2019年《Cheaper Private Set Intersection via Differentially Private Leakage》提出了一种用于恶意安全 2PC 中差分隐私泄露的安全模型,同时还引入了两种新的改进机制,用于“差分隐私直方图高估”,这是差分隐私 PSI 的主要技术挑战。2020年《Differentially Private Two-Party Set Operations》将差分隐私、同态加密和电路进行结合,使得通信复杂度能达到 O ( m ) O(m) O ( m ) wheremmm is the smallest size of the dataset. In 2023, "Split, Count, and Share: A Differentially Private Set Intersection Cardinality Estimation Protocol" introduced how to estimate the intersection cardinality through differential privacy PSI. According to its test results, this algorithm can better replace the traditional intersection cardinality (PSI-CA) protocol.

The private intersection cardinality (PSI-CA) protocol is to calculate the number of private elements that exist in two privacy sets at the same time, that is, the size of the intersection of these two privacy sets.

Summarize

In practical applications, it is necessary to balance the communication complexity, computing complexity, and security requirements: (1) If the network is the computing bottleneck, the PSI algorithm based on public key encryption with low communication complexity can be considered; (2) If computing resources are the bottleneck, the PSI algorithm based on hash or OT can be considered; (3) For high-security application scenarios, it is recommended to use PSI algorithms based on GC, homomorphism, and difference. At the same time, when considering the choice of PSI, it is necessary to consider whether the data sets of both parties are the same size. The algorithms in many of the above papers assume that the data sets of both parties are consistent (balanced PSI). In the non-balanced PSI scenario, the amount of communication and calculation is generally determined by the larger data set.

At present, PSI has significant applications in relationship path discovery in social networks , botnet detection , fully sequenced human genome testing , proximity testing , online cheater detection games , and intelligence gathering .

references

Guess you like

Origin blog.csdn.net/shuizhongmose/article/details/131539087