(A) Federal learn - Getting acquaintance

List of
(a) Federal learn - Getting acquaintance
(b) of the Federal learn -Fate standalone deployment

1. Generate

1.1 Development of Artificial Intelligence

In recent years, artificial intelligence can be described as catch, setting off a wave of wave after wave, from face recognition, live test found that the criminal case to the police Alpha Dog Wars humans go hand Shishi, to unmanned, and has been widely used in precision marketing, AI gradually entered every aspect of people's lives. Of course, can not help but appear over-touted part, led to a misunderstanding of the AI -AI omnipotent, as so easy, why I can not make use of it? While chasing AI has missed the point, AI rely on data to feed, but also a large number of high-quality data.
In real life, except for a few giant companies able to meet the vast majority of companies there is less data, poor data quality problems, insufficient to support the realization of artificial intelligence technology.

1.2 laws and regulations, protect data privacy

With the further development of big data, attention to data privacy and security has become a worldwide trend. Each will be public disclosure of data caused great attention from the media and the public, such as the recent Facebook data breaches caused a large-scale protests. At the same time countries are strengthening the protection of data security and privacy, the new bill EU has recently introduced the "General Data Protection Regulation" (General Data Protection Regulation, GDPR) indicates that user data privacy and security management increasingly stringent will be the world trend. This has brought unprecedented challenges to the field of artificial intelligence, research and business communities present situation is one of the collection of data is usually not a party to use the data, such as the A-side data collection, transferred to the B-side to clean, and then transferred to the C side modeling, Finally, the model sold to D's use. This data transfer between entities, in the form of exchange and transactions in violation of GDPR, and the bill may have been severely punished. Similarly, in China since 2017 implementation of the "Network Security Act People's Republic of China" and the "People's Republic of China General Principles of Civil Law," also pointed out that network operators must not disclosure, alteration, destruction of personal information it collects, and data transactions with third parties when the need to ensure the development of the proposed contract expressly agreed scope of data protection obligations and transaction data. These regulations establish varying degrees of artificial intelligence traditional data processing model presents new challenges.

1.3 data island issues

With the introduction of the importance of data security and privacy protection bill, before extensive data sharing is challenged, the individual data back to the data owner island state, at the same time, Internet companies are also more difficult to collect and utilize user's private data.
Data island phenomenon is not only not disappear, but will become the new norm, even it exists not only between different companies and organizations also exist within the large group. The future, we must face the status quo: if we want to make better use of data, do more meaningful things with big data and AI, it is necessary, data sharing between companies and between users in different organizations, but the need to meet the shared data privacy protection and security of the premise.
Privacy and data leakage abuse like the sword of Damocles hanging over the head of various companies and organizations, thus solving data islands, has become one of the most important issue to be solved AI industry.

1.4 Federal born learning

To solve the above difficulties, relying only on traditional methods bottlenecks have emerged. How the premise of meeting data privacy, security and regulatory requirements, to design a framework for machine learning, artificial intelligence systems to allow more efficient and accurate use of their data together, is an important topic in artificial intelligence development. We propose to shift the focus to the study of how to solve the problem of data silos. We propose a meet privacy protection and data security of a viable solution, called the federal study.
Federal study are:

  • Parties to retain data locally, without revealing not violate privacy laws and regulations;

  • More participants to create a virtual total joint data model, and the common benefit of the system;

  • In the federal system of learning, each participant of the same identity and status;

  • Federation model learning effect and the whole set of data in a model of the effect of the same, or a little difference (in the alignment of each data user (User
    Alignment) or feature (feature alignment) under the conditions of alignment);

  • Migration in the case where learning is not user or alignment features can also be achieved by knowledge of the effect of migration between the data exchange encryption parameters.

Federal learn two or more parties make use of the data no local entity data can also be used together to solve problems in cooperation among data silos.

2. Definitions

2.1 Overview

Federal study hope to do each enterprise's own data not local, then the federal system parameters can be exchanged under way encryption mechanism, ie without violating data privacy regulations, the establishment of a virtual consensus model. This virtual model just like we put together the best model data aggregation built the same. But when creating a virtual model of the data itself does not move, does not affect data leakage and privacy compliance. In this way, a model built in their local area only target service. In such a federal mechanism, the same identity and status of each participant, while the federal system to help you build common prosperity of the strategy. That's why this system is called the federal study.

2.2 definitions

In order to accurately set forth the idea of ​​the federal study, which we define as follows: When the owner of a number of data (such as business) F_i, i = 1 ... N wants to joint their respective data D_i training machine learning models, the traditional approach is combining data into one and uses data D = {Di, i = 1 ... N} and the training model obtained M_sum. However, this solution because of its privacy and legal issues related to data security is often difficult to implement. To solve this problem, we propose a federal study. Federal learning means so that these data have not given in square F_i under one's own data D_i case can also be obtained calculation model training and model M_FED and be able to ensure that the gap between the model and the effect V_SUM effect V_FED M_SUM model of M_FED small enough, that is:

|V_FED-V_SUM |<δ, Where δ is a positive value arbitrarily small.

2.3 Classification

We will silos distribution data as the basis for federal study classified. Consider the owning a plurality of data, each data held by each party has D_i data set may be represented by a matrix. Each row of the matrix represents a user, each column represents a user characteristic. At the same time, some data sets may also contain label data. If you want to build predictive models of user behavior, it must have a label data. We can put a user feature is called X, the label feature is called Y. For example, in the financial sector, the user's credit that need to be predicted label Y; in the marketing field, the tag is a user wishes to purchase Y; in the field of education, it is the students' mastery of knowledge and so on. Y wherein X user tagging constitute a complete training data (X, Y). However, in reality, often encounter such a situation: each user data sets are not identical, or user characteristics are not identical. Specifically, the owner of the data comprising two federal learning, for example, distribution data can be divided into the following three cases:
the user characteristic  two data sets (X1, X2, ...) overlapping part is large, and the user ( U1, U2 ...) smaller overlapping portions;
user  two data sets (U1, U2 ...) overlapping part is large, and the user characteristic (X1, X2, ...) smaller overlapping portions;
user  two datasets (U1, U2 ...) overlaps with a user characteristic (X1, X2, ...) portions are relatively small.

Here Insert Picture Description

2.3.1 transverse federal study

The user characteristic data sets in two overlapping overlap more and less user case, we set the data in accordance with a lateral (i.e., User Dimensions) segmentation, and remove the same user features both not identical to the user data portion training. This method is called transverse federal study. For example, there are two different parts of the bank, their user groups, respectively, from their respective regions, each intersection is very small. However, their operations are similar, and therefore, the user record is the same feature. At this point, we can build models using the lateral joint federal study. Google Android phones presents a model for joint modeling program update data in 2017: when a single user to use Android phones constantly uploaded to An Zhuoyun locally update the model parameters and parameter, so that the same characteristic dimensions of each data It has a kind of federal program to enter into a joint learning model.
Here Insert Picture Description

 step1:参与方各自从服务器A下载最新模型;
step2:每个参与方利用本地数据训练模型,加密梯度上传给服务器A,服务器A聚合各用户的梯度更新模型参数;
step3:服务器A返回更新后的模型给各参与方;
step4:各参与方更新各自模型。

Step Interpretation: In a conventional machine learning model, usually the model training set of data required to train a model of the data center and then, after the prediction. In the transverse federal study, can be seen as a distributed model based on training samples, distribute all the data to different machines, each machine model downloaded from the server, and then use the local data model training, after returning to the parameter server needs to be updated; server aggregation parameters returned on each machine, update the model, then the latest model feedback to each machine.

In this process, at each machine they are the same and complete the model, and no exchange no dependencies between machines, each machine can also be an independent predictor in predicting, this process can be seen as a distributed model to a sample-based training . Google originally in landscape federal solution to end users in the Android phone update models of local problem.

2.3.2 Federal longitudinal study

Two sets of user data in many overlapping and in the case wherein the user overlap less, we set the data according to the characteristic dimension i.e., longitudinal) segmentation, and remove both the user and the user features are not exactly the same as that portion of the training data . This method is called the Federal longitudinal study. For example, there are two different institutions, a bank is a place, another is the same place the electricity supplier. Their user base is likely to contain most of the inhabitants of the land, so the user a larger intersection. However, due to the behavior of the balance of payments is the user's bank records and credit rating, and electricity providers is to maintain user browsing and purchase history, so their users intersection of smaller features. Federal longitudinal study of these different features is to be polymerized in an encrypted state, to enhance the ability of the Federal learning model. At present, the logistic regression model, tree model and neural network model and many other machine learning models have been shown to gradually build on this federal system. The learning steps shown above, is divided into two big:

The first step: encrypt sample alignment. It is to do it at the system level, so the company is not exposed to the perceived level of non-cross-user.

Step two: Align samples encrypt training model:

step1:由第三方C向A和B发送公钥,用来加密需要传输的数据;
step2:A和B分别计算和自己相关的特征中间结果,并加密交互,用来求得各自梯度和损失;
step3:A和B分别计算各自加密后的梯度并添加掩码发送给C,同时B计算加密后的损失发送给C;
step4:C解密梯度和损失后回传给A和B,A、B去除掩码并更新模型。

Here Insert Picture Description
Training specific steps are as follows:
Here Insert Picture Description
party throughout the process and do not know the characteristics of the other data, and only after the end of the training participants to give their side of the model parameters, i.e. mold halves.

Forecasting process:
Due to the parties involved can only get the model parameters relevant to them, both sides need to collaborate on the prediction, as shown below:
Here Insert Picture Description
the result of joint modeling:

双方均获得数据保护
共同提升模型效果
模型无损失

2.3.3 Migration federal study

In the case where the user with the user characteristic data sets overlap two are less We data is not segmented, and may be utilized to overcome the situation data transfer learning label or insufficient. This method is called the Federal Migration study. For example, there are two different bodies, one is located in China's banks, it is another electricity supplier in the United States. Due to geographical restrictions, user groups the intersection of these two institutions is very small. Meanwhile, due to the different types of bodies, both feature data and only a small portion of the overlap. In this case, in order to be effective federal study, we must introduce transfer learning to solve data unilateral small-scale and small label sample questions, so as to enhance the effect of the model.

3. scenarios

3.1 Financial Wisdom

联邦学习作为一种保障数据安全的建模方法,在销售、金融等行业中拥有巨大的应用前 景。在这些行业中,受到知识产权、隐私保护、数据安全等诸多因素影响,数据无法被直接 聚合来进行机器学习模型训练。此时,就需要借助联邦学习来训练一个联合模型。 以智慧零售业务为例,它的目的是利用机器学习技术为用户带来个性化的产品服务,主 要包括产品推荐与销售服务。智慧零售业务中涉及到的数据特征主要包含用户购买能力,用 户个人偏好,以及产品特点三部分,但是在实际应用中,这三种数据特征很可能分散在三个 不同的部门或企业。例如,银行拥有用户购买能力的特征,社交网站拥有用户个人偏好特征, 而购物网站则拥有产品特点的特征。这种情况下,我们面临两大难题:首先,出于保护用户 隐私以及企业数据安全等原因,银行、社交网站和购物网站三方之间的数据壁垒是很难被打 破的。因此,智慧零售的业务部门无法直接把数据进行聚合并建模;其次,这三方的用户和 用户特征数据通常是异构的,传统的机器学习模型无法直接在异构数据上进行学习。目前, 这些问题在传统的机器学习方法上都没有得到切实有效的解决,它们阻碍着人工智能技术在 社会更多领域中的普及与应用。 而联邦学习正是解决这些问题的关键。设想一下,在智慧零售的业务场景中,我们使用 联邦学习与迁移学习对三方的数据进行联合建模。首先,利用联邦学习的特性,我们不用导 出企业的数据,就能够为三方联合构建机器学习模型,既充分保护了用户隐私和数据安全, 又为用户提供了个性化,针对性的产品服务,从而实现了多方共同受益。同时,我们可以借 鉴迁移学习的思想来应对用户和用户特征数据异构的问题。迁移学习能够挖掘数据间的共同 知识并加以利用,从而突破传统人工智能技术的局限性。可以说,联邦学习为我们建立一个 跨企业、跨数据、跨领域的大数据 AI 生态提供了良好的技术支持。

3.2智慧医疗

如今,智慧医疗也在成为一个与人工智能相结合的热门领域。然而,目前的智慧医疗水 平还远没有达到真正“智慧”的程度。下面,我们将通过 IBM“沃森”的例子探讨目前智慧 医疗的不足之处,并提出一种利用联邦迁移学习提高智慧医疗水平的构想。 IBM 的超级电脑“沃森”是人工智能在医疗领域最出名的应用之一。在医疗领域,沃森 被中国、美国等多个国家的医疗机构用于自动诊断,主攻对多种癌症疾病的确诊以及提供医疗建议。然而,沃森也在不断遭受着外界的质疑。最近曝光的一份文件显示,沃森曾经在一 次模拟训练中错误地开出了可能会导致患者死亡的药物。沃森医疗项目也因此备受打击。那 么沃森为何会做出错误的诊断呢?我们发现,沃森使用的训练数据本应包括病症、基因序列、病理报告、检测结果、医学论文等数据特征。但是在实际中,这些数据的来源却远远不够, 并且大量数据面临着标注缺失的问题。有人估计,把医疗数据放在第三方公司标注,需要动 用 1 万人用长达 10 年的时间才能收集到有效的数据。数据的不足与标签的缺失导致了机 器学习模型训练效果的不理想,这成为了目前智慧医疗的瓶颈所在。 那么,如何才能突破这一瓶颈呢?我们设想,如果所有的医疗机构都联合起来,贡献出 各自那一部分数据,那将会汇集成为一份足够庞大的数据,而对应的机器学习模型的训练效 果也能得到质的突破。实现这一构想的主要途径便是联邦学习与迁移学习。它适用的原因有两个方面:第一,各个医疗机构的数据必然有很大的隐私性,直接进行数据交换并不可 行,联邦学习则能保证不进行数据交换的同时进行模型训练。第二,数据仍然存在着标签缺 失严重的问题,而迁移学习则可以用来对标签进行补全,从而扩大可用数据的规模,进一步 提高模型效果。因此,联邦迁移学习必将在智能医疗的发展道路上扮演弥足轻重的角色。在未来,如果所有的医疗机构能建立一个联邦迁移学习联盟,那或许可以使人类的医疗卫生事业迈上一个全新的台阶。

4.目前进展

当前,业界解决隐私泄露和数据滥用的数据共享技术路线主要有两条。一条是基于硬件可信执行环境(TEE: Trusted Execution Environment)技术的可信计算,另一条是基于密码学的多方安全计算(MPC:Multi-party Computation)。

4.1 TEE

TEE 字面意思是可信执行环境,核心概念为以第三方硬件为载体,数据在由硬件创建的可信执行环境中进行共享。这方面以 Intel 的 SGX 技术,AMD 的 SEV 技术,ARM 的 Trust Zone 技术等为代表。TEE 方案的大致原理如下图所示:
Here Insert Picture Description

目前在生产环境可用的 TEE 技术,比较成熟的基本只有 Intel 的 SGX 技术,基于 SGX 技术的各种应用也是目前业界的热门方向,微软、谷歌等公司在这个方向上都有所投入。

4.1.1 SGX

SGX(Software Guard Extensions )是 Intel 提供的一套软件保护方案。SGX 通过提供一系列 CPU 指令码,允许用户代码创建具有高访问权限的私有内存区域(Enclave - 飞地),包括 OS,VMM,BIOS,SMM 均无法私自访问 Enclave,Enclave 中的数据只有在 CPU 计算时,通过 CPU 上的硬件进行解密。同时,Intel 还提供了一套远程认证机制(Remote Attestation),通过这套机制,用户可以在远程确认跑在 Enclave 中的代码是否符合预期。
英特尔® SGX 技术可通过在特定硬件 (例如内存) 中构造出一个可信的“飞地”(Enclave),使数据和应用程序的安全边界仅限于“飞地”本身以及处理器,同时其运行过程也不依赖于其他软硬件设备。这意味着数据的安全保护是独立于软件操作系统或硬件配置之外,即使硬件驱动程序、虚拟机乃至操作系统均受到攻击和破坏,也能更有效地防止数据泄露。
Here Insert Picture DescriptionHere Insert Picture Description

4.2 MPC

MPC(Multi-party Computation,多方安全计算)一直是学术界比较火的话题,但在工业界的存在感较弱,之前都是一些创业小公司在这个方向上有一些探索,例如 Sharemind,Privitar,直到谷歌提出了基于 MPC 的在个人终端设备的“联邦学习” (Federated Learning)的概念,使得 MPC 技术一夜之间在工业界火了起来。MPC 方案的大致原理如下图所示:
Here Insert Picture Description

4.2.1混淆电路

混淆电路是图灵奖得主姚期智教授在 80 年代提出的一个方法。其原理是,任意函数最后在计算机语言内部都是由加法器、乘法器、移位器、选择器等电路表示,而这些电路最后都可以仅由 AND 和 XOR 两种逻辑门组成。一个门电路其实就是一个真值表,假设我们把门电路的输入输出都使用不同的密钥加密,设计一个加密后的真值表,这个门从控制流的角度来看还是一样的,但是输入输出信息都获得了保护。

4.2.2 秘密分享

秘密分享的基本原理是将每个数字随机拆散成多个数并分发到多个参与方那里。然后每个参与方拿到的都是原始数据的一部分,一个或少数几个参与方无法还原出原始数据,只有大家把各自的数据凑在一起时才能还原真实数据。

4.2.3同态加密

同态加密是一种特殊的加密方法,允许对密文进行处理得到仍然是加密的结果,即对密文直接进行处理,跟对明文进行处理后再对处理结果加密,得到的结果相同。同态性来自抽象代数领域的概念,同态加密则是它的一个应用。

4.3 案例

4.3.1蚂蚁金服共享机器学习

为了更好的应对形势变化,解决数据共享需求与隐私泄露和数据滥用之间的矛盾,蚂蚁金服提出了希望通过技术手段,确保多方在使用数据共享学习的同时,能做到:用户隐私不会被泄露,数据使用行为可控,我们称之为共享机器学习(Shared Machine Learning)。
共享机器学习的定义:在多方参与且各数据提供方与平台方互不信任的场景下,能够聚合多方信息并保护参与方数据隐私的学习范式。
从 17 年开始,蚂蚁金服就一直在共享机器学习方向进行探索和研究,在结合了 TEE 与 MPC 两条路线的同时,结合蚂蚁的自身业务场景特性,聚焦于在金融行业的应用。

4.3.1.1 特性

蚂蚁金服共享机器学习方案拥有如下特性:

  • 多种安全计算引擎整合,可基于不同业务场景来选择合适的安全技术。既有基于 TEE 的集中式解决方案,也有基于 MPC
    的分布式解决方案;既可满足数据水平切分的场景,也能解决数据垂直切分的诉求;既可以做模型训练,也可以做模型预测。
  • 支持多种机器学习算法以及各种数据预处理算子。支持的算法包括但不限于 LR,GBDT,Xgboost,DNN,CNN,RNN,GNN 等。
  • 大规模集群化。支持大规模集群化,提供金融级的高效、稳定、系统化的支撑。
4.3.1.2 基于 TEE 的共享学习

蚂蚁共享学习底层使用 Intel 的 SGX 技术,并可兼容其它 TEE 实现。目前,基于 SGX 的共享学习已支持集群化的模型在线预测和离线训练。

4.3.1.2.1模型在线预测

预测通常是在线服务。相对于离线训练,在线预测在算法复杂度上面会相对简单,但是对稳定性的要求会更高。
提升在线服务稳定性的关健技术之一就是集群化的实现——通过集群化解决负载均衡,故障转移,动态扩容等稳定性问题。
但由于 SGX 技术本身的特殊性,传统的集群化方案在 SGX 上无法工作。
为此,蚂蚁金服设计了如下分布式在线服务基本框架:

Here Insert Picture Description

该框架与传统分布式框架不同的地方在于,每个服务启动时会到集群管理中心(ClusterManager,简称 CM)进行注册,并维持心跳,CM 发现有多个代码相同的 Enclave 进行了注册后,会通知这些 Enclave 进行密钥同步,Enclave 收到通知后,会通过远程认证相互确认身份。当确认彼此的 Enclave 签名完全相同时,会通过安全通道协商并同步密钥。
该框架具备如下特性:

1.通过集群化方案解决了在线服务的负载均衡,故障转移,动态扩缩容,机房灾备等问题;
2.通过多集群管理和 SDK 心跳机制,解决代码升级,灰度发布,发布回滚等问题;
3.通过 ServiceProvider 内置技术配合 SDK,降低了用户的接入成本;
4.通过提供易用性的开发框架,使得用户在开发业务逻辑时,完全不需要关心分布式化的逻辑;
5.通过提供 Provision 代理机制,确保 SGX 机器不需要连接外网,提升了系统安全性。

目前在这套框架之上已经支持包括 LR、GBDT、Xgboost 等多种常用的预测算法,支持单方或多方数据加密融合后的预测。基于已有框架,也可以很容易的扩展到其它算法。

4.3.1.2.2模型离线训练

模型训练阶段,除了基于自研的训练框架支持了 LR 和 GBDT 的训练外,蚂蚁金服还借助于 LibOs Occlum 和自研的分布式组网系统,成功将原生 Xgboost 移植到 SGX 内,并支持多方数据融合和分布式训练。通过上述方案,不仅可以减少大量的重复性开发工作,并且在 Xgboost 社区有了新的功能更新后,可以在 SGX 内直接复用新功能,无需额外开发。目前我们正在利用这套方案进行 TensorFlow 框架的迁移。
此外,针对 SGX 当下诟病的 128M 内存限制问题(超过 128M 会触发换页操作,导致性能大幅下降),我们通过算法优化和分布式化等技术,大大降低内存限制对性能的影响。
基于 TEE 的多方数据共享学习训练流程如下:

1.机构用户从 Data Lab 下载加密工具
2.使用加密工具对数据进行加密,加密工具内嵌了 RA 流程,确保加密信息只会在指定的 Enclave 中被解密
3.用户把加密数据上传到云端存储
4.用户在 Data Lab 的训练平台进行训练任务的构建
5.训练平台将训练任务下发到训练引擎
6.训练引擎启动训练相关的 Enclave,并从云端存储读取加密数据完成指定的训练任务。

Here Insert Picture Description

采用该方式进行数据共享和机器学习,参与方可以保证上传的数据都经过加密,并通过形式化验证保证加密的安全性。

4.3.1.3 基于 MPC 的共享学习

蚂蚁基于 MPC 的共享学习框架分为三层:

  • 安全技术层:安全技术层提供基础的安全技术实现,比如在前面提到的秘密分享、同态加密、混淆电路,另外还有一些跟安全密切相关的,例如差分隐私技术、DH
    算法等等;
  • 基础算子层:在安全技术层基础上,我们会做一些基础算子的封装,包括多方数据安全求交、矩阵加法、矩阵乘法,以及在多方场景下,计算
    sigmoid 函数、ReLU 函数等等;同一个算子可能会有多种实现方案,用以适应不同的场景需求,同时保持接口一致;
  • 安全机器学习算法:有了基础算子,就可以很方便的进行安全机器学习算法的开发,这里的技术难点在于,如何尽量复用已有算法和已有框架,我们在这里做了一些有益的尝试,但也遇到了很大的挑战。

Here Insert Picture Description

训练引擎的具体架构如下:
Here Insert Picture Description

其中 Coordinator 部署于蚂蚁平台,用于任务的控制和协调,本身并不参与实际运算。Worker 部署在参与多方安全计算的机构,基于安全多方协议进行实际的交互计算。
用户在建模平台构建好的训练任务流会下发给 Coordinator 的 Task Flow Manager,Task Flow Manager 会把任务进行拆解,通过 Task Manager 把具体算法下发给 Worker 端的 Task Executor,Task Executor 根据算法图调用 Worker 上的安全算子完成实际的运算。
利用这套方法,可以做到数据不出域就可以完成数据共享,训练工具可以部署在本地的服务器。

5资料参考

1. Federal learning white paper
https://img.fedai.org.cn/fedweb/1552917119598.pdf
2. different from Google "federal study" ants gold dress presented new data silos Solution: Shared Machine Learning
https: // www .infoq.cn / Article This article was / R2aw6rPCrUvfZA0ivjHO
3. used "hard" data security technology to break the silos federal study and practice
https://www.intel.cn/content/www/cn/zh/analytics/artificial-intelligence/break-down -silos-with--data Hardware-Enhanced-security.html
4. based on shared learning TEE: data silos solutions
https://cloud.tencent.com/developer/article/1511840
Detailed federal learning learning Federated
HTTPS: // zhuanlan.zhihu.com/p/79284686

Recommended reading: (b) Federal learn -Fate standalone deployment

Published 15 original articles · won praise 10 · views 3000

Guess you like

Origin blog.csdn.net/qq_28540443/article/details/104416436