Ants gold dress sharing intelligence practice: how to reduce the difficulty of data sharing?

Introduction:  Artificial Intelligence existing problem is the fish and can not have both, which is difficult to balance privacy with availability. If you want to play a role in AI system, you may need to sacrifice privacy. However, the large number of real scenarios, if I can not taking into account privacy and availability, will lead to the plight of many AI landing.

With the introduction of data security and privacy protection and the importance of the bill, has extensive data sharing is challenged, the individual data back to the data owner island state. At the same time, Internet companies are also more difficult to collect and use the user's private data , data silos but become the norm. If you want to make better use of data, it is necessary, data sharing between different organizations, companies and users meet in privacy protection and data security premise.

To solve this problem, many technology companies at home and abroad has introduced solutions such as Google launched a federal study, ants gold dress made of shared intelligence and so on. In this paper, InfoQ gold suit ant machine learning algorithms Zhou Jun, head of Taiwan were interviewed, sharing intelligence to understand how to solve the problem of data sharing in the financial sector.

Sharing the difference between intelligence and federal learning

Before introducing the technology practice, we need to take the time to clarify the difference between sharing intelligence and federal learning, to facilitate the reader understand the scope of this article.

Currently, the industry's resolve of privacy and misuse of data sharing technical data there are two main routes . Is a trusted execution environment based on hardware: Trusted Computing (TEE Trusted Execution Environment) technology, the other is a multi-computing security and cryptology (MPC: Multi-party Computation) based.

TEE literally trusted execution environment, the core concept for the third-party hardware support, data sharing in a trusted execution environment created by the hardware. Currently available in a production environment TEE technology, more mature basic SGX only Intel's technology, based on various applications SGX technology is currently popular direction of the industry, Microsoft, Google and other companies have invested in this direction.

 

image.png

MPC (Multi-party Computation, multi-party Secure Computing) has been a topic of academic community more fire, but there is a sense in the industry is weak, before the venture are some small companies in this direction has some exploration, such as Sharemind, Privitar, until Google MPC is proposed based on the concept of individual terminal equipment "Federal learning" (Federated learning), making the MPC technology in the industry overnight fire up.

 

image.png

Currently, the industry for data sharing scenarios, the use of technology route as the introduction of a number of solutions, including privacy protection machine learning PPML, federal learning, competing study, trusted machine learning and other technical solutions using different routes of each other will have some overlapping. Zhou Jun said that the ants gold dress made of shared intelligence (also known as: shared machine learning ) combines with the MPC TEE two routes, combined with its business scenarios characteristics of ants, focused on the application of the financial industry.

In simple terms, the concept of shared intelligence, or idea, is to participate in the multi-scene and each party data providers and platform of distrust, can aggregate information from multiple sources for analysis and machine learning, and to ensure the privacy of the parties involved not leaked, the information is not abused.

About sharing intelligence and federal learning differences, Zhou Jun said that at present, the federal study involved two different concepts:

  • The first is Google's proposed federal study aims to address cloud + client training process, the end is not exposed on the privacy issue, which is a To C + level data segmentation scene. In addition to protecting the privacy of data on the end, but it also focused on how to solve the training process itself may end dropped calls and other problems.
  • 第二种是国内提出的联邦学习,主要用于解决 To B 场景中各方隐私不泄露的问题,既可应用于数据的水平切分场景,也可应用于数据垂直切分的场景。它们侧重于不同的数据共享场景,技术上有不同的侧重点。

2019 年,一篇由多个知名大学和企业撰写的关于联邦学习的综述文章《Advances and Open Problems in Federated Learning》,对联邦学习的定义和描述是比较清晰的。首先,联邦学习的架构是由一台中心服务器和多个计算节点构成,中心服务器会参与到整个计算过程,因此不适用于一些不需要中心服务器节点的应用场景(文章中将这种模式称为 Fully Decentralized Distributed Learning )。此外,联邦学习要求原始数据不能出域,这也限制了其可以使用的技术方案,而共享智能是从问题出发,作为一个新兴的技术领域,在面临当前各种复杂场景的时候,很难用一套技术方案去解决所有问题,因此共享智能的解决方案中不仅包含有类似联邦学习的有中心服务器参与计算的模式,也包含完全去中心化的方案,还有基于 TEE 的共享学习方案。

在不同的场景下,不同的方案各有优劣。周俊表示,目前,数据共享下的机器学习仍然还有很多待突破的地方,我们并不纠结于解决问题的是联邦学习还是去中心化的分布式学习,或者是其它任何技术方案,最终还是希望大家能够合力解决这个业界难题。

蚂蚁金服共享智能应用实践

2016 年开始,蚂蚁金服就开始投入到共享智能的研究中,出发点是为了解决业务中遇到的问题,比如机构与蚂蚁金服的信息协同问题。基于此,蚂蚁金服调研了差分隐私、矩阵变换等多种方案,确定了目前的技术大方向。

纵观整个研发阶段,周俊认为大致可以分为探索期、技术攻坚和技术应用三个时期。

  • 探索期:对业界相关技术进行全面摸底,并设计了上百个方案,逐一验证可行性,并在真实场景反复锤炼技术,实现从 0 到 1 的突破;
  • 技术攻坚期:经过前面的摸索,确定了几个可能适用于工业界的方案,进一步在大规模工业场景下,对这些方案的安全性和性能等逐一优化提升;
  • 技术应用期:开始大规模在真实业务场景中应用,直面业务需求,进一步淬炼技术,接受市场检验。

在共享智能的技术细节上,周俊表示,可以按照 TEE 和 MPC 两条路线来理解。

基于 TEE 的共享学习

蚂蚁共享学习底层使用 Intel 的 SGX 技术,并可兼容其它 TEE 实现。下面着重介绍一下基于 TEE 的共享学习中的一种数据加密出域的方案,目前,这种方案已支持集群化的模型在线预测和离线训练。

1. 模型在线预测

预测通常是在线服务。相对于离线训练,在线预测在算法复杂度上面会相对简单,但是对稳定性的要求会更高。提升在线服务稳定性的关健技术之一就是集群化的实现——通过集群化解决负载均衡,故障转移,动态扩容等稳定性问题。

但由于 SGX 技术本身的特殊性,传统的集群化方案在 SGX 上无法工作。

为此,蚂蚁金服设计了如下分布式在线服务基本框架

 

image.png

该框架与传统分布式框架不同的地方在于,每个服务启动时会到集群管理中心(ClusterManager,简称 CM)进行注册,并维持心跳,CM 发现有多个代码相同的 Enclave 进行了注册后,会通知这些 Enclave 进行密钥同步,Enclave 收到通知后,会通过远程认证相互确认身份。当确认彼此的 Enclave 签名完全相同时,会通过安全通道协商并同步密钥。

2. 模型离线训练

模型训练阶段,除了基于自研的训练框架支持了 LR 和 GBDT 的训练外,蚂蚁金服还借助于 LibOS Occlum(蚂蚁主导开发,已开源)自研的分布式组网系统,成功将原生 Xgboost 移植到 SGX 内,并支持多方数据融合和分布式训练。通过上述方案,不仅可以减少大量的重复性开发工作,并且在 Xgboost 社区有了新的功能更新后,可以在 SGX 内直接复用新功能,无需额外开发。目前,蚂蚁金服正在利用这套方案进行 TensorFlow 框架的迁移。

此外,针对 SGX 当下诟病的 128M 内存限制问题(超过 128M 会触发换页操作,导致性能大幅下降),蚂蚁金服通过算法优化和分布式化等技术,大大降低内存限制对性能的影响。

上述方案在多方数据共享学习训练流程如下:

  1. 机构用户从 Data Lab 下载加密工具
  2. 使用加密工具对数据进行加密,加密工具内嵌了 RA 流程,确保加密信息只会在指定的 Enclave 中被解密
  3. 用户把加密数据上传到云端存储
  4. 用户在 Data Lab 的训练平台进行训练任务的构建
  5. 训练平台将训练任务下发到训练引擎
  6. 训练引擎启动训练相关的 Enclave,并从云端存储读取加密数据完成指定的训练任务。

 

image.png

此外,针对有一些数据提供方不希望数据出域的场景,蚂蚁还提供了使用 TEE 对训练过程中的参数信息进行加密的技术方案,篇幅原因,就不在这里展开了。

基于 MPC 的共享学习

蚂蚁基于 MPC 的共享学习框架分为三层:

  • 安全技术层:安全技术层提供基础的安全技术实现,比如在前面提到的秘密分享、同态加密、混淆电路,另外还有一些跟安全密切相关的,例如差分隐私技术、DH 算法等;
  • 基础算子层:在安全技术层基础上,蚂蚁金服会做一些基础算子的封装,包括多方数据安全求交、矩阵加法、矩阵乘法,以及在多方场景下,计算 sigmoid 函数、ReLU 函数等;同一个算子可能会有多种实现方案,用以适应不同的场景需求,同时保持接口一致;
  • 安全机器学习算法:有了基础算子,就可以很方便的进行安全机器学习算法的开发,这里的技术难点在于,如何尽量复用已有算法和已有框架,蚂蚁金服在这里做了一些有益的尝试,但也遇到了很大的挑战。

 

image.png

目前,这套基于 MPC 的共享学习框架已支持了包括 LR、GBDT、DNN 等头部算法,后续一方面会继续根据业务需求补充更多的算法,同时也会为各种算子提供更多的技术实现方案,以应对不同的业务场景。

基于 MPC 的多方数据共享学习训练流程如下:

  1. 机构用户从 Data Lab 下载训练服务并本地部署
  2. 用户在 Data Lab 的训练平台上进行训练任务的构建
  3. 训练平台将训练任务下发给训练引擎
  4. 训练引擎将任务下发给机构端的训练服务器 Worker
  5. Worker 加载本地数据
  6. Worker 之间根据下发的训练任务,通过多方安全协议交互完成训练任务

 

image.png

训练引擎的具体架构如下:

 

image.png

其中 Coordinator 部署于蚂蚁平台,用于任务的控制和协调,本身并不参与实际运算。Worker 部署在参与多方安全计算的机构,基于安全多方协议进行实际的交互计算。

用户在建模平台构建好的训练任务流会下发给 Coordinator 的 Task Flow Manager,Task Flow Manager 会把任务进行拆解,通过 Task Manager 把具体算法下发给 Worker 端的 Task Executor,Task Executor 根据算法图调用 Worker 上的安全算子完成实际的运算。

利用这套方法,可以做到数据不出域就可以完成数据共享,训练工具可以部署在本地的服务器。

对金融领域的重要意义

无论是联邦学习还是共享智能,很多技术实践都优先选择了在金融领域落地。相较于其他领域,金融领域对数据的管控更为严格,对数据隐私更加重视,因此也是最需要通过技术手段解决数据孤岛问题的领域。

周俊表示,在金融领域,共享智能侧重在解决“开放”这个大领域中的问题,比如联合营销、联合风控等,这两个场景相对更容易看到具体实施效果。相比其他领域,金融领域对数据保护看的更重,数据的流转在该领域中更难,因此采用共享智能技术,可以做到更好的隐私保护,实现数据可用不可见,是一个关键的助推器。

举例来说,通过数据融合,蚂蚁金服的共享智能帮助中和农信大幅度提高了风控性能,把原来传统的线下模式,变成线上自动过审模式,完成授信只需 5 分钟,8 个月累计放款 31.9 亿,授信成功人数 44 万人,业务覆盖 20 多个省区,300 多县城,10000 多个乡村。

企业落地难,如何解决?

虽然该技术的落地对金融企业有着重要意义,但很多公司在实际的落地过程中遇到了问题,可能是技术原因,也可能是处于对结果的担忧。

采访中,周俊表示,共享智能技术属于交叉领域,涉及到密码学机器学习等技术,有一定的门槛,企业部署这样的技术,需要结合自身技术能力以及业务需求来综合考量。当然,蚂蚁金服也在积极探索降低企业落地门槛的技术和方案,随着越来越多的企业一起参与进来,相信不远的将来,共享智能的技术落地将不再会有太高的门槛。

此外,蚂蚁金服的共享智能是一个开放的生态,希望更多的企业能参与进来一起共建,而不需要重新再去走蚂蚁金服之前走过的很多弯路。金融企业可以根据自身业务发展的需要,及时跟进业界最新进展,从而选择更合适的技术和合作方来解决业务难题。能够让业务赢,解决业务痛点,是这里面最重要的因素。

More importantly, shared intelligence to solve the trust problem, so large-scale landing on the premise that users have a comprehensive understanding and trust in sharing intelligence. Ants Jin Li served by benchmarking, pushing the standard, directional open source and other ways to gradually build user confidence in sharing intelligence. Currently, the ants gold dress has a number of institutions in the field of intelligent landing a benchmark credit-based business scene. At the same time, promote the sharing of intelligence-led industry standards, Union standards, national standards as well as IEEE, ITU-T and other international standards. Zhou Jun said, we believe that with the simultaneous development of technology and user mind, sharing intelligence massive landing will happen soon, and the first to benefit, is data-driven, and there is a strong demand for financial technology to protect privacy and medical technology industries.

Conclusion

Facing the future, Zhou Jun said that the focus is to continue to promote the whole industry together to solve the problem of data sharing. Ants gold dress will gradually open technology capabilities, enabling the industry to the needs of enterprises, will join more units, including research institutions and enterprises to jointly promote to solve technical problems. Eventually we hope to work together to create a whole industry can implement to protect user privacy and prevent misuse of the data under the premise of data sharing intelligent network interoperability , better achieve inclusive finance.

Published 289 original articles · won praise 1010 · Views 550,000 +

Guess you like

Origin blog.csdn.net/alitech2017/article/details/104694749