PYGRID: p2p platform for private data science and federal learning

 

 

What if you can train all the data in the world without leaving it to the device while keeping the data confidential?

  PyGrid is a peer-to-peer platform for private data science and joint learning. With PyGrid, data owners can provide, monitor, and manage access to their own private data clusters. The data does not leave the data owner's server.

  Then, data scientists can use PyGrid to conduct private statistical analysis of private data sets, and even joint learning across data sets from multiple institutions.

  This blog post will cover:

1. Understand the basic concepts required by PyGrid, such as federal learning and secure multi-party computing PySyft library and PyGrid platform-PySyft is a private machine learning library deployed using PyGrid platform

2. Several practical examples of using PyGrid for privacy protection analysis: These examples will help us understand the architecture of PyGrid and figure out how to apply it to practical problems

3. OpenMined's 2020 PyGrid development roadmap

 

Joint learning

  The first concept we need to understand is joint learning. Joint learning is a technology that allows AI models to learn without requiring users to abandon the data. How does this work? The first step is to create an initial model. The data scientist then sends the model to the dataset owner (in this case, Joe ’s device).

 

Now Joe can update this model by training it in his data set. After training, the updated model will be returned to the AI ​​company.

 

 

Now, AI Incorporated sends the updated AI model to another device, in this case Jane's device. Jane's will update this model by training it in her data set. After the model has trained Jane's data, Jane sends the updated weights back to the data scientist.

 

 

  Now, the model has learned something from the data of Joe and Jane. We can repeat this process on multiple nodes, and we can even train the model on multiple nodes at the same time and average them, which can improve the model faster.

  The main benefits of federal learning are: Training data is kept on the user ’s device (or hospital server) This increases the privacy of sensitive data Reduces legal liability to model owners (data scientists, companies) Reduces the network involved in uploading large data sets Bandwidth is easy to think of potential use cases for this technology. E.g,

 

 

Secure Multiparty Computing Secure Multiparty Computing (SMPC) is another way to encrypt data and share it with different devices. Unlike traditional encryption technologies, the main advantage of SMPC is that we can use encrypted data to perform logical and arithmetic operations. How to use multi-party calculations for mathematical operations?

This is a (very) simplified working example:

在此示例中,我们让安德鲁(Andrew)持有他的号码,在这种情况下,他是号码5的所有者,即他的个人数据。安德鲁可以将其数据匿名化,将其数字分解为2个(或更多)不同的数字。在这种情况下,他将数字5分解为2和3。这样,他就可以与朋友Marianne和Bob共享匿名数据。

在这里,没有人真正知道安德鲁数据的真正价值。他们只持有其中的一部分。它们中的任何一个都可以执行任何类型的操作,而无需所有人的同意。但是,尽管这些数字在它们之间进行了加密,但我们仍然可以执行计算。这样,我们可以使用加密的值来计算用户数据,而无需显示任何敏感信息。

 

 

 

After understanding these concepts, we can now explain PySyft and PyGrid.

PySyft library

PySyft是用于安全和私有深度学习的Python库。 PySyft旨在在主要的深度学习框架(例如PyTorch和TensorFlow)中提供隐私保护工具。这样,数据科学家可以使用这些框架来应用隐私保护概念来管理任何类型的敏感数据,而不必成为隐私专家和他们自己。

PyGrid platform

 

PyGrid旨在成为一个使用PySyft框架进行联合学习和数据科学的点对点平台。

该体系结构由两个组件组成:网关和节点。网关组件像DNS一样工作,路由提供所需数据集的节点。

节点由数据所有者提供:它们是私有数据集群,将由其数据所有者管理和监视。数据不会离开数据所有者的服务器。

然后,数据科学家可以使用PyGrid对该数据集执行私有统计分析,甚至可以跨多个机构的数据集进行联合学习。

下面,我们说明如何完成每个用例。

 

Use case 1: Private statistical analysis Let's explore two workflows:

Data owners who want to publish sensitive data on their nodes. (In this case, it is the pediatric ward of the hospital).

A data scientist who wants to find a specific data set through a grid network to calculate some statistical analysis.

 

Data owner

Step 1: The first step in importing PySyft and dependencies as data owners is to import our dependencies.

In this case, we will import syft and use syft hook to replace the standard torch module.

 

Step 2: Connect your node The next step is to connect your own node. It is important to note that node applications have been deployed in certain environments, and you need to know their addresses in advance. In this case, we will connect the hospital nodes.

步骤3:将数据准备为张量并添加简短描述
现在,我们需要准备要在医院节点上发布的数据集。为了清楚地了解我们要发布的数据,我们应该添加简要说明,以解释数据的含义和数据结构。在这种用例中,我们要发布医院的每月出生记录。

 

Step 4: After defining access rules and permissions, we need to define rules to control data access. In this case, we allow certain users (Bob, Anna and Alice) to fully access the actual value of this data.

 

Step 5: Add tags and labels to help data scientists find your data set In order to make our data accessible from queries, we also need to add tags to identify and tag them. In this example, we added two tags: #February is used to identify the month, and # birth-records is used to identify the meaning of the data

步骤6:发布!你完成了。
现在,可以准备发布数据了。请务必注意,需要允许您在此节点上发布私有数据集。在此示例中,我们使用Bob的凭据在节点上发布此数据。

 

As the data owner, this is all we have to do!

 

Data scientist

 

Step 1: Import PySyft and dependencies as data scientists. We also need to import the syft library and replace the torch module with syft hooks.

 

Step 2: Connecting to the grid platform is different from the data owner. We don't know where the nodes and data sets are, so we first need to connect to the GridNetwork. The address of the grid network will be the gateway component address

步骤3:您要寻找什么数据?搜索网络。
连接到网格网络后,我们可以搜索所需的数据集标签。也许您正在寻找肺炎的X光片或医院的出生记录。在此示例中,我们使用的是之前发布的标签。网格网络将返回一个字典,其中包含节点的ID作为键,而数据指针作为值。

步骤4:创建对该数据指针的引用
接下来,我们定义对医院数据指针的直接引用。

 

Step 4: Understanding and exploring data For any data scientist, understanding the data you are using is crucial. Next, we can explore data pointers to understand their meaning and how they are organized.

 

Wait-What if I try to copy the data? If we try to retrieve the real value of the data pointer without being allowed, an exception will be thrown. In this way, PyGrid can save the data in the hands of the owner and give the control control to allow or deny access to the data samples.

 

Step 5: Perform calculations Even without copying the data, we can still perform remote calculations on the data. In this example, we want to calculate the average of the baby's weight and height. For this, we need to sum the column values ​​remotely.

 

 

Now, we can calculate the weight sum remotely. It will generate other remote tensors.

 

We can use the height column to perform the same operation.

 

Now, we just need to use our credentials to retrieve the summary value, and then divide the value by 5 to get the data set size.

 

We can do the same thing to get the average height. In this way, we can calculate the average weight and height of babies born this month without having to access any sensitive data.

 

  finished! We know the average height and weight of babies born in February without having to move the dataset to our own server, so we never need to receive any private information about individual babies.

 

我们如何管理数据访问?
在不久的将来,我们将提供一个简单的界面来验证和管理张量规则。作为您自己的网格节点的管理员,您将能够管理该节点的帐户。作为数据所有者,您可以标识和控制谁可以访问您的节点。

 

 

Grid will have the right to allow or deny access by evaluating requests using different technologies.

 

用例2:跨Silo联合学习
我们如何使用PyGrid架构跨机构或设备执行联合学习?在这种使用情况下,我们将使用联合学习方法来训练MNIST模型。数据所有者使用数据集样本填充节点的过程与用例1相同,因此我们将直接跳到数据科学家的工作流程。

联邦学习-作为数据科学家

 

步骤1:导入PySyft和依赖项

步骤2:定义我们的模型架构
现在,我们需要定义模型架构以及在神经网络上执行ML流程所需的所有东西。

 

Step 3: Connect to the grid platform As before, we need to connect to the grid network to perform queries on the nodes.

 

Step 4: Search for the required data set In this example, we are searching for the MNIST data set and its label.

 

As we have seen here, our mesh network has some nodes that host the MNIST dataset.

步骤5:创建对该数据指针的引用
现在,我们只需要获取直接引用即可处理这些指针。

 

This feature will help us figure out how the federal learning algorithm works.

步骤6:训练模型!
在此训练功能中,我们将遍历数据指针以找到相应的工作人员并远程训练模型。这里我们有一些说明:

第一个是model.send(worker):此函数会将我们的全局模型的副本发送给当前工人,以使用该工人的数据进行培训。
第二个模型是model.get():训练完本地数据后,我们需要检索本地模型以更新全局模型。
这样,联合学习过程将遍历承载MNIST数据集的所有节点。

用例3:加密的MLaaS
我们如何以安全和私有的方式托管模型并执行推理?

针对此问题的PyGrid解决方案是使用名为plan和Multi-Party Computation(MPC)协议的数据结构。计划是一种数据结构,用于定义和序列化将远程执行的一组指令。使用计划,我们可以定义将在远程设备中执行的模型结构。这样,使用分布在不同计算机上的远程MPC指针的计划可以以安全的方式执行模型推断。

步骤1:导入PySyft和依赖项
如前所述,第一步是导入我们的依赖项。请务必注意,我们需要将hook.local_worker.is_client_worker设置为False。这将使syft库在其结构上存储计划的元数据。

步骤2:连接到网格平台
现在,如前所述,我们需要连接到grid。

 

Step 3: Define the model Here, we are defining the model. It is important to note that it is necessary to extend Plan on the definition of Model. For ease of explanation, we will use a linear model with only one layer, which has pre-set weights and bias values. In this way, we can predict and understand the results.

 

Step 4: Define the input data Our input data will be this one-dimensional tensor. Therefore, we have been able to visualize the final result.

 

Step 5: Initialize the model Now, we need to initialize the model. For comparison, we will also initialize a decryption model

 

Step 6: Put the model on the grid network Finally, we can use this model on the grid network. Please pay attention to the MPC logo. This flag will allow the syft library to use the MPC protocol to split plan parameters into different devices over the network.

 

Here, we can see where the plan and its parameter slices are. Company data clusters host our plan structure, and hospital data clusters and public data clusters host MPC parameter values. Finally, the University Data Cluster is our encryption provider that allows us to perform MPC multiplication.

 

Step 7: Return the result

 

Here, we have many things happening at the same time. Therefore, let us delve into:

1- This function will download the plan structure from the company's data cluster and retrieve its remote pointer.

2- The second step splits the input_values ​​into MPC values ​​and shares it with the same device that carries the planned MPC pointer.

3- Since all MPC pointers are distributed on the device, we can execute the plan.

4- Finally, we can summarize the MPC results, which will return the actual results.

 

Comparison with decryption model For comparison, if we execute the decrypted model, we will get the same result.

 

OpenMined's 2020 PyGrid roadmap In 2020, the PyGrid team has four main goals:

Heterogeneous networks (syft.js, swift.js, mobile workers) First, create a standard method for sending and receiving messages from different platforms. Today, the PyGrid platform is a server-based platform, which means you need to set up nodes and provide infrastructure for them. We intend to extend PyGrid functionality to mobile devices.

Privacy budget Establish a privacy budget to assess and control the level of data anonymization.

Automatic differential privacy tracking This allows the privacy budget of entities in the dataset to be automatically tracked over time.

In this way, we can formally guarantee the amount of information leaked when we release private assets (such as AI models). It is best to automate this infrastructure as much as possible, but it also allows applications (UI) that allow humans to review digital assets for private information (which may be necessary for early adopters).

Data request queue We will create a data request queue that allows data owners to evaluate data requests that control data access.

 

Published 30 original articles · praised 74 · 230,000 views +

Guess you like

Origin blog.csdn.net/ruiyiin/article/details/105589898