Get started with federated learning quickly - Tencent's self-developed federated learning platform PowerFL

Introduction: In the past 10 years, machine learning has developed rapidly in the field of artificial intelligence, and one of the key fuels is the large amount of data accumulated by human society. However, despite the rapid growth of data scale in general, the vast majority of data are scattered in various companies or departments, resulting in serious isolation and fragmentation of data; it is precisely because of this that organizations have a strong willingness to cooperate in data, but Based on data privacy and security considerations, there are many challenges to achieve data cooperation in compliance with regulations.

The data silos formed based on the above reasons are seriously hindering the collaborative data construction of artificial intelligence models by all parties. Therefore, a new mechanism is urgently needed to solve the above problems. Federated learning emerged as the times require. Through this emerging technology, the process of exchanging model information between organizations is carefully designed and encrypted throughout the entire process under the premise of ensuring user privacy and data security, so that no organization can guess any other organization. Privacy data content, but achieve the purpose of joint modeling.

PowerFL is a federated learning platform developed by Tencent TEG. It has been implemented in business scenarios such as financial cloud and advertising joint modeling, and has achieved preliminary results. PowerFL, in the form of technology empowerment, builds a bridge for data communication between different departments and teams, and makes data synergy possible on the premise of protecting data privacy. This article will first introduce the overall technical architecture of PowerFL from three aspects: platform framework, deployment view and network topology.

  • Introduction to the Platform Framework of PowerFL

  • Deployment view and key components of PowerFL

  • Network topology of PowerFL

  • Rapid deployment of PowerFL

    • Prepare k8s cluster

    • Prepare Yarn cluster (optional)

    • Preparing to install the client

    • One-click deployment of PowerFL

  • Submit federated tasks via flow-server

    • The process of federal task orchestration and scheduling

  • Summarize

Introduction to the Platform Framework of PowerFL

From the perspective of the platform framework, PowerFL builds the technology and ecology of the entire federated learning from the following five levels, from the bottom to the top:

  1. Computing and data resources: PowerFL supports two mainstream computing resource scheduling engines, YARN and K8S: All service components are deployed on the K8S cluster in the form of containers, which greatly simplifies deployment and operation and maintenance costs, and can easily achieve fault tolerance and service. At the same time, all computing components are scheduled through the YARN cluster, thus ensuring the stability and fault tolerance of computing while ensuring the parallel acceleration of large-scale machine learning tasks. In addition, PowerFL also supports pulling data from a variety of data sources, including TDW, Ceph, COS, HDFS, etc.
  2. Computing framework: On top of computing and data resources, PowerFL implements a computing framework for federated learning algorithms. Compared with traditional machine learning frameworks, this framework focuses on solving the most common problems in federated learning algorithms and applications in practice. Several difficulties: 1) Secure encryption: PowerFL implements various common homomorphic encryption, symmetric and asymmetric encryption algorithms (including Paillier, RSA asymmetric encryption and other algorithms); 2) Distributed computing: based on Spark on Angel A high-performance distributed machine learning framework, PowerFL can easily implement various efficient distributed federated learning algorithms; 3) Cross-network communication: PowerFL provides a set of multi-party cross-network transmission interfaces, and the bottom layer uses message queue components to ensure data security. On the premise, stable and reliable high-performance cross-network transmission is achieved; 4) TEE/SGX support: In addition to ensuring data security through software, PowerFL also supports data encryption and calculation in the Enclave through TEE/SGX, thereby The algorithm performance is greatly improved in hardware.
  3. Algorithm protocol: Based on the above computing framework, PowerFL implements common federation algorithm protocols for different scenarios: 1) In analysis scenarios, PowerFL supports joint query and two-party/multi-party sample alignment; 2) For modeling scenarios, PowerFL supports federation Feature engineering (including feature selection, feature filtering, and feature transformation, etc.), federated training (including logistic regression, GBDT, DNN, etc.), and joint prediction.
  4. Product interaction: From the perspective of end users, PowerFL, as an application product of federated learning, not only supports the invocation of federated tasks in the form of REST API, but also supports various model participants to work together on the joint workspace by dragging and dropping algorithm components. to build and configure federated task flows and manage users, resources, configurations, and tasks.
  5. Application scenarios: After improving the above-mentioned federated learning infrastructure, PowerFL can solve the problems caused by data isolation and fragmentation in multiple application scenarios such as financial risk control, advertising recommendation, crowd profiling, and joint query under the premise of security compliance. The problem of "data silos" truly empowers AI and big data applications that comply with privacy norms.

Deployment view and key components of PowerFL

From a deployment view perspective, PowerFL includes a service layer and a compute layer:

  • The service layer is built on the K8S cluster, using its excellent resource scheduling capabilities, perfect expansion and contraction mechanism, and stable fault-tolerant performance to deploy PowerFL's resident services on service nodes in the form of containers. These resident service components include:
    • Message middleware: responsible for event-driven between all services and computing components, algorithm synchronization between computing components of all parties, and asynchronous communication of encrypted data.
    • Task flow engine: responsible for controlling the scheduling of unilateral federated task flow. The execution node is called up in the form of a container according to the pre-defined task flow sequence. The computing task is executed in the execution node, or the computing task is submitted to the YARN cluster in the execution node. .
    • Task panel: It is responsible for collecting the key performance indicators of each iteration of each algorithm component in the task flow or the output of the final model, such as AUC, Accuracy, KS, feature importance, etc.
    • Multi-party federated scheduling engine: Responsible for the scheduling and synchronization of federated tasks among multiple parties, and provides a set of APIs to provide interfaces for federated task flow creation, task initiation, termination, suspension, deletion, and status query.
  • The computing layer is built on the YARN cluster, which makes full use of the Spark big data ecological suite and is responsible for the distributed computing of each algorithm component of the PowerFL runtime. The computing task is actually initiated by the task node of the service layer and applies for resources to the YARN cluster to run the federated operator of PowerFL. Based on the computing framework of Spark on Angel, it ensures the high parallelism and excellent performance of the algorithm.

Network topology of PowerFL

From the perspective of intranet users, PowerFL exposes service paths to intranet users through the ingress of k8s:

  • By visiting http://domain/, you can access the task panel interface to understand the running status of the current task, the key logs during the task running process, and the display of key performance indicators, etc.;
  • By visiting http://domain/pipelines, you can access the interface of the task flow engine and view the running stage of the task on this side:
  • The REST API of flow-server can be accessed by visiting http://domain/flow-server.

Between computing tasks (such as drivers and executors of spark tasks) and service layer components, communication is provided through message middleware;

The intermediate encryption results obtained during the execution of the computing task and the task status information that needs to be synchronized are synchronized across the external network through their respective message middleware.

Rapid deployment of PowerFL

After the overall understanding of PowerFL's platform framework, deployment view and network topology, the following will introduce how to quickly deploy PowerFL. As mentioned above, PowerFL is divided into service layer components and computing layer, which are built on k8s cluster and YARN cluster respectively. . Before deploying PowerFL, you need to prepare these two cluster environments (if the computing tasks do not require a distributed environment, you do not need to prepare a YARN cluster environment).

Get the latest version of the installation package and unzip it before doing the following.

Prepare k8s cluster

Machine configuration requirements:

  • Number of machines: 1+
  • Hardware configuration: 16G+ Mem, CPU 4+, hard disk 100G+
  • OS and version: CentOS 7.0 is recommended
  • Docker 1.8+

If you are in a test environment, you can refer to the documentation to install Minikube. The VM driver can be kvm2 (linux) or hyperkit, or VirtualBox (macOS). Create k8s through Minikube:

minikube start --memory=8192 --cpus=4

If you are installing k8s in a production environment, you can refer to the official documentation of k8s to deploy a k8s cluster in a production environment. The options include:

  • Use kubeadm to install.
  • Use kops to install.
  • Use KRIB for installation.
  • Use Kubespray to install.
  • You can also use Tencent Cloud's TKE

If you need to install k8s offline, you can refer to the documentation in offline-k8s-deploy in the installation package directory for installation.

Prepare Yarn cluster (optional)

You can refer to the documentation of Apache Ambari to install the YARN cluster. After the installation is complete, prepare the hadoop configuration file and put it in the hadoop-config directory:

core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml

And import the above configuration file into the k8s cluster. The imported configuration here is named hadoop:

kubectl -n power-fl-[partyId] create configmap hadoop --from-file=./hadoop-config

Preparing to install the client

  1. Install jq and envsubst
    # 如果是Ubuntu
    sudo apt-get install jq envsubst
    
    # 如果是CentOS
    sudo yum install jq envsubst
    
  2. To install helm 3.0+, you can refer to Helm's official documentation for installation. In short, you can do the following:
    • macOS uses Homebrew to install:
      brew install helm
      
    • Windows can be installed using Chocolatey
      choco install kubernetes-helm
      
    • Install from the command line
      curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash
      

One-click deployment of PowerFL

The following operations are performed on the client machine, and the execution directory is the root directory of Fl-deploy:

cd FL-deploy

Install for the first time

  1. Prepare the configuration file for kubectl:

    mkdir kube
    cp ~/.kube/config ./kube
    
  2. Copy the environment configuration template file and set the corresponding environment variables:

    cp _powerfl_env.sh ./powerfl_env.sh
    vim ./powerfl_env.sh
    
    #!/bin/bash
    
    # 设置参与方id,各参与方的id必须唯一
    export PARTY_ID=10000
    # 设置参与方访问powerfl相关服务的域名
    export DOMAIN=powerfl-10000.com
    
    # 内网内访问MQ的地址,用于内部Hadoop集群访问
    export INTERNAL_MQ_HOST=xx.xx.xx.xx
    # 内网内访问MQ的tcp端口
    export INTERNAL_MQ_TCP_PORT="xxxx"
    
    # 暴露给外网的MQ-proxy的http端口(对于默认的k8s配置,端口范围为30000-32767)
    export EXPOSE_MQ_HTTP_PORT="xxxx"
    # 暴露给外网的MQ的tcp端口(对于默认的k8s配置,端口范围为30000-32767)
    export EXPOSE_MQ_TCP_PORT="xxxx"
    
    # 设定不需要安装的组件,用空格分开
    export DISABLED_COMPONENTS=""
    
    # 其它参与方的id,用空格分开
    export OTHER_PARTIES="20000"
    # 其它参与方的MQ配置,如OTHER_PARTIES="20000 30000"
    # 则需要分别配置
    #    PARTY_MQ_HTTP_URL_20000和PARTY_MQ_PROXY_URL_20000
    # 和 PARTY_MQ_HTTP_URL_30000和PARTY_MQ_PROXY_URL_30000
    export PARTY_MQ_HTTP_URL_20000=yy.yy.yy.yy:yyyy
    export PARTY_MQ_PROXY_URL_20000=yy.yy.yy.yy:yyyy
    
  3. Execute the script install:

    ./deploy.sh setup
    

Install multiple parties at the same time

If you need to install multiple parties on the same k8s cluster, you can copy multiple copies of _powerfl_env.sh as the environment configuration files for different parties, and specify them when executing the deployment script (if not specified, the current directory will be used by default. powerfl_env.sh as the environment configuration file, as shown above):

cp _powerfl_env.sh ./powerfl-10000.sh
vim ./powerfl-10000.sh # 修改PARTY_ID等配置

cp _powerfl_env.sh ./powerfl-20000.sh
vim ./powerfl-20000.sh # 修改PARTY_ID等配置

# 部署参与方10000,指定配置文件powerfl-10000.sh
./deploy.sh setup ./powerfl-10000.sh

# 部署参与方20000, 指定配置文件powerfl-20000.sh
./deploy.sh setup ./powerfl-20000.sh

update system

If you need to modify the system-related configuration, you can modify the corresponding environment configuration file and component configuration file, and execute:

./deploy.sh upgrade

# 如果指定配置文件./deploy.sh upgrade ./powerfl-10000.sh

uninstall system

Note: Doing this will delete all data of powerfl, which cannot be recovered:

./deploy.sh cleanup

# 如果指定配置文件./deploy.sh cleanup ./powerfl-10000.sh

Submit federated tasks via flow-server

After installing PowerFL, you can use the specified DSL to write task flow and task parameter configuration files to submit federated tasks to flow-server. Before introducing the specific usage method, first understand PowerFL's federated task orchestration and scheduling process.

The process of federal task orchestration and scheduling

1) Write the task flow file pipeline.yaml and import the pipeline to flow-server:

curl --request POST 'http://{domain}/flow-server/pipelines' --form '[email protected]'

If successful, it will return the id of the pipeline; for the pipeline just imported, write the task configuration parameter file job.yaml (DSL will be introduced later), and submit the task to flow-server:

curl -request POST 'http://{domain}/flow-server/pipelines/{pipeline_id}/jobs' --form 'parameters=@job_parameters.yaml'

2) After the flow-server receives the task request submission, it will randomly generate the id of this task, inject the global related configuration to build the final DSL for the task flow engine scheduling and submit the task flow to it.

3) According to the above DSL file, the task flow engine applies to the K8S cluster for resources in the order of the nodes defined by the task flow, and calls up the runtime container of the corresponding node;

4 and 5) After the runtime container is started, it will apply for resources from the YARN cluster according to the injected environment variable information, start the driver and executor, call up the specific algorithm process, and execute the parallel computing task. At this point, the algorithm task on this side has been started. .

6) On the other hand, after the flow-server receives the task request submission, it will pass the configuration file information of the task to the local message middleware;

7) The local message middleware synchronizes the above task configuration files to the message middleware of other participants through cross-network synchronization;

8) The flow-server of other participants listens to the topic submitted by the message middleware task, and receives the start request of the new federated task;

9) The process after that is the same as 3) 4) 5)

The format of the task configuration parameter file is as follows:

parties: [ "10000=guest", "20000=host" ]
common-args:
  spark-master-name: local[*]
  runtime-image: power_fl/runtime:develop
parties-args:
  10000:
    hadoop-config: hadoop
    hadoop-user-name: root
    hdfs-libs-path: hdfs:///fl-runtime-libs
    spark-submit-args: ""
    input: /opt/spark-app/fl-runtime/data/a9a.guest.head
    output: /tmp/a9a.guest.output
  20000:
    hadoop-config: hadoop
    hadoop-user-name: root
    hdfs-libs-path: hdfs:///fl-runtime-libs
    spark-submit-args: ""
    input: /opt/spark-app/fl-runtime/data/a9a.host.head
    output: /data/a9a.host.output

The above document consists of three main parts:

  1. parties: specify each party of the federation task in the format of an array, specify the id of each party and the role in this task in the format of partyId=role
  2. common-args: Specifies the parameters used by all participants. The configurable parameters must be consistent with the spec.arguments.parameters specified in pipline.yaml.
  3. parties-args: Specifies the parameters of each party, including hadoop configuration information, task algorithm parameter configuration information, etc.

Summarize

PowerFL consolidates the technology and ecology of the entire federated learning from the bottom up from the five levels of computing and data resources, computing framework, algorithm protocol, product interaction and application scenarios, and builds the entire system on the k8s cluster in a cloud-native way, and fully Using the big data ecology of the YARN cluster, based on Spark on Angel, high-performance distributed computing for federated tasks is realized. This paper first introduces the overall architecture of PowerFL, including the technology stack, key components and network topology, and on this basis, introduces how to deploy PowerFL with one click and how to define the federal task flow to submit federal tasks. I hope this article can help you quickly get started with federated learning, gain an in-depth understanding of this new privacy-preserving machine learning modeling mechanism, and apply it to more fields such as e-commerce, finance, medical care, education, urban computing, and more.

Scan the code to follow | Immediately understand Tencent's big data technology trends

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324149434&siteId=291194637