In-depth understanding of federated learning - Private Set Intersection (PSI): basic knowledge

Category Catalog: "In-depth Understanding of Federated Learning" General Catalog


Private Set Intersection (PSI) is a key pre-step in vertical federated learning. It is used to find data samples shared by multiple vendors before joint calculation by multiple vendors without exposing the unique samples of each vendor. .

Suppose two companies, A and B, hope to join forces to train a machine learning model to predict whether a user is interested in technology products. Company A has the purchase history data of three users A, B, and C, while company B has the information flow article browsing data of three users B, C, and D. Using vertical federated learning, under the premise that Company A and Company B do not leak their respective user data, we can integrate the data characteristics of Company A and Company B of two users, B and C, and jointly train a prediction model. Since two categories are used For training using data, theoretically the results should be more accurate than the models trained by Company A or Company B respectively.

Since model training requires the use of data from both Company A and Company B, and User A only has Company A's data but not Company B's data, User A cannot be used as a training sample. Similarly, user D from Company B cannot participate in training. Therefore, before vertical federated learning, both parties need to calculate the common samples, that is, the two users B and C, and subsequent calculations are performed around B and C. Private Set Intersection (PSI) is a method in which both parties obtain the sets of B and C through encrypted calculations without exposing their original sets.

Privacy set intersection means that the two participating parties obtain the intersection of the data held by both parties without revealing any additional information, that is, privacy set intersection:

  • There are many parties, each holding their own private data
  • I hope to find the intersection of all data through the protocol
  • Do not leak any information except intersection

Here, additional information refers to any information other than the intersection of data from both parties. Private set intersection is very useful in real-life scenarios, such as data alignment in vertical federated learning, or in social software, discovering friends through address books. Therefore, a safe and fast algorithm for intersection of privacy sets is very important.

We can use a very intuitive method to intersect private sets, which is the naive hash method. That is, both parties A and B use the same hash function HHH , calculate the hash value of each data separately, and then send the hashed data to each other to find the intersection. This method seems very simple and fast, but it is insecure and may leak additional information. If the data itself that the two parties need to intersect has a relatively small data space, such as mobile phone numbers, ID numbers, etc., then for a malicious participant, a hash collision can be used to generate the data within a limited time. The hash value passed by the other party, thereby stealing additional information. Therefore, we need to design a more secure method for intersection of privacy sets.

There are many different methods in today's theory to realize the intersection of privacy sets, such as methods based on Diffie-Hellman key exchange, methods based on inadvertent transmission, etc. So far, the fastest way to intersect private sets is based on inadvertent transmission. In subsequent articles, we will introduce various algorithms for intersection of privacy sets.

References:
[1] Yang Qiang, Liu Yang, Cheng Yong, Kang Yan, Chen Tianjian, Yu Han. Federated Learning [M]. Electronic Industry Press, 2020 [2] WeBank, FedAI. Federated Learning White Paper V2.0
. Tencent Research Institute, etc., 2021

Guess you like

Origin blog.csdn.net/hy592070616/article/details/132815425