Use KubeFATE to deploy a multi-machine federated learning environment (2)

Table of contents

1. Preparation work

2. Deployment operations

1. Generate the deployment script file and deploy it (operate on the deployment machine, which is machine A)

2. Verify whether the deployment is successful

3. Connectivity verification

3. Simple training and reasoning

1. Upload data

2. Carry out training

3. View training results

4. Delete deployment


1. Preparation work

  1. Two hosts (physical machine or virtual machine, Ubuntu or Centos7 system, allowing login as root user)
  2. Install Docker on all hosts
  3. Install Docker-Compose on all hosts
  4. The deployment machine can be connected to the Internet, so the hosts can communicate with each other.
  5. The running machine has downloaded the component images of FATE

How to install docker and docker-compose, and how to download the FATE image are introduced in the previous article.

https://blog.csdn.net/SAGIRIsagiri/article/details/124105064https://blog.csdn.net/SAGIRIsagiri/article/details/124105064

The two machines here are both virtual machines, CentOS 7 system, here they are called machine A and machine B. Machine A serves as both a deployment machine and a target machine. The IP address of machine A is 192.168.16.129, and the IP address of machine B. is 192.168.16.130, all logged in as root.

2. Deployment operations

1. Generate the deployment script file and deploy it (operate on the deployment machine, which is machine A)

//下载并解压Kubefate1.3的kubefate-docker-compose.tar.gz资源包
# curl -OL https://github.com/FederatedAI/KubeFATE/releases/download/v1.3.0/kubefate-docker-compose.tar.gz

# tar -xzf kubefate-docker-compose.tar.gz        //进行解压

# cd docker-deploy/            //进入docker-deploy目录
 
# vi parties.conf              //编辑parties.conf配置文件
 
user=root                                   
dir=/data/projects/fate                     
partylist=(10000 9999)          //此处为两个集群的ID            
partyiplist=(192.168.16.129 192.168.16.130)       //此处写入两个目标机的IP
servingiplist=(192.168.16.129 192.168.16.130)     //此处写入两个目标机的IP
exchangeip=
 
# bash generate_config.sh          //生成部署文件
 
# bash docker_deploy.sh all        //执行启动部署集群脚本
//需要输入几次目标机的root密码

2. Verify whether the deployment is successful

Verify on target machines A and B respectively.

# docker ps                    //集群A(ID 10000)
CONTAINER ID   IMAGE                                      COMMAND                  CREATED          STATUS          PORTS                                                                                            NAMES
6186cc50baa1   federatedai/serving-proxy:1.2.2-release    "/bin/sh -c 'java -D…"   14 minutes ago   Up 12 minutes   0.0.0.0:8059->8059/tcp, :::8059->8059/tcp, 0.0.0.0:8869->8869/tcp, :::8869->8869/tcp, 8879/tcp   serving-10000_serving-proxy_1
870a3048336b   federatedai/serving-server:1.2.2-release   "/bin/sh -c 'java -c…"   14 minutes ago   Up 12 minutes   0.0.0.0:8000->8000/tcp, :::8000->8000/tcp                                                        serving-10000_serving-server_1
9a594365a451   redis:5                                    "docker-entrypoint.s…"   14 minutes ago   Up 12 minutes   6379/tcp                                                                                         serving-10000_redis_1
44a0df69d2b1   federatedai/egg:1.3.0-release              "/bin/sh -c 'cd /dat…"   18 minutes ago   Up 17 minutes   7778/tcp, 7888/tcp, 50000-60000/tcp                                                              confs-10000_egg_1
22fe1f5e1ec1   federatedai/federation:1.3.0-release       "/bin/sh -c 'java -c…"   18 minutes ago   Up 17 minutes   9394/tcp                                                                                         confs-10000_federation_1
f75f0405b4bc   mysql:8                                    "docker-entrypoint.s…"   18 minutes ago   Up 17 minutes   3306/tcp, 33060/tcp                                                                              confs-10000_mysql_1
a503e90b1548   redis:5                                    "docker-entrypoint.s…"   18 minutes ago   Up 17 minutes   6379/tcp                                                                                         confs-10000_redis_1
b09a08468ad3   federatedai/proxy:1.3.0-release            "/bin/sh -c 'java -c…"   18 minutes ago   Up 17 minutes   0.0.0.0:9370->9370/tcp, :::9370->9370/tcp                                                        confs-10000_proxy_1
# docker ps                    //集群B(ID 9999)
CONTAINER ID   IMAGE                                    COMMAND                  CREATED          STATUS          PORTS                                       NAMES
27262d0be615   federatedai/roll:1.3.0-release           "/bin/sh -c 'java -c…"   10 minutes ago   Up 9 minutes    8011/tcp                                    confs-9999_roll_1
e0b244d55562   federatedai/meta-service:1.3.0-release   "/bin/sh -c 'java -c…"   11 minutes ago   Up 10 minutes   8590/tcp                                    confs-9999_meta-service_1
6e249db9451c   federatedai/egg:1.3.0-release            "/bin/sh -c 'cd /dat…"   12 minutes ago   Up 10 minutes   7778/tcp, 7888/tcp, 50000-60000/tcp         confs-9999_egg_1
8db5215d3998   mysql:8                                  "docker-entrypoint.s…"   12 minutes ago   Up 11 minutes   3306/tcp, 33060/tcp                         confs-9999_mysql_1
d16f4c43fb05   federatedai/proxy:1.3.0-release          "/bin/sh -c 'java -c…"   12 minutes ago   Up 11 minutes   0.0.0.0:9370->9370/tcp, :::9370->9370/tcp   confs-9999_proxy_1
b5062d978a12   federatedai/federation:1.3.0-release     "/bin/sh -c 'java -c…"   12 minutes ago   Up 11 minutes   9394/tcp                                    confs-9999_federation_1
ad673a6e2c4a   redis:5                                  "docker-entrypoint.s…"   12 minutes ago   Up 11 minutes   6379/tcp                                    confs-9999_redis_1

3. Connectivity verification

Run the following command on the deployment machine (Machine A)

# docker exec -it confs-10000_python_1 bash        //进入部署机的python容器

# cd /data/projects/fate/python/examples/toy_example  //进入测试脚本文件夹

# python run_toy_example.py 10000 9999 1           //运行测试脚本,1代表多机

A successful test will return the following content

"2019-08-29 07:21:25,353 - secure_add_guest.py[line:96] - INFO: begin to init parameters of secure add example guest"
"2019-08-29 07:21:25,354 - secure_add_guest.py[line:99] - INFO: begin to make guest data"
"2019-08-29 07:21:26,225 - secure_add_guest.py[line:102] - INFO: split data into two random parts"
"2019-08-29 07:21:29,140 - secure_add_guest.py[line:105] - INFO: share one random part data to host"
"2019-08-29 07:21:29,237 - secure_add_guest.py[line:108] - INFO: get share of one random part data from host"
"2019-08-29 07:21:33,073 - secure_add_guest.py[line:111] - INFO: begin to get sum of guest and host"
"2019-08-29 07:21:33,920 - secure_add_guest.py[line:114] - INFO: receive host sum from guest"
"2019-08-29 07:21:34,118 - secure_add_guest.py[line:121] - INFO: success to calculate secure_sum, it is 2000.0000000000002"

In this way, the FATE federated learning environment between the two machines is completed.

3. Verify Serving-Service function

Use the two deployed FATE clusters for simple training and inference testing. The data set used for training is "breast", which is a simple test data set that comes with FATE. It is placed in "examples/data" and is divided into There are two parts "breast_a" and "breast_b". The host participating in the training holds "breast_a", while the guest holds "breast_b". The guest and host jointly perform logistic regression training on the data set. The finally trained model is pushed to FATE Serving for online inference.

1. Upload data

The following operations are performed on machine A

# docker exec -it confs-10000_python_1 bash            //进入python容器

# cd fate_flow                                         //进入fate_flow目录

# vi examples/upload_host.json                         //修改上传配置文件

{
  "file": "examples/data/breast_a.csv",
  "head": 1,
  "partition": 10,
  "work_mode": 1,
  "namespace": "fate_flow_test_breast",
  "table_name": "breast"
}

//将“breast_a.csv”上传到系统中
# python fate_flow_client.py -f upload -c examples/upload_host.json  

The following operations are performed on machine B

# docker exec -it confs-9999_python_1 bash            //进入python容器

# cd fate_flow                                         //进入fate_flow目录

# vi examples/upload_guest.json                         //修改上传配置文件

{
  "file": "examples/data/breast_b.csv",
  "head": 1,
  "partition": 10,
  "work_mode": 1,
  "namespace": "fate_flow_test_breast",
  "table_name": "breast"
}

//将“breast_b.csv”上传到系统中
# python fate_flow_client.py -f upload -c examples/upload_guest.json  

2. Carry out training

# vi examples/test_hetero_lr_job_conf.json        //修改训练用配置文件

{
    "initiator": {
        "role": "guest",
        "party_id": 9999
    },
    "job_parameters": {
        "work_mode": 1
    },
    "role": {
        "guest": [9999],
        "host": [10000],
        "arbiter": [10000]
    },
    "role_parameters": {
        "guest": {
            "args": {
                "data": {
                    "train_data": [{"name": "breast", "namespace": "fate_flow_test_breast"}]
                }
            },
            "dataio_0":{
                "with_label": [true],
                "label_name": ["y"],
                "label_type": ["int"],
                "output_format": ["dense"]
            }
        },
        "host": {
            "args": {
                "data": {
                    "train_data": [{"name": "breast", "namespace": "fate_flow_test_breast"}]
                }
            },
             "dataio_0":{
                "with_label": [false],
                "output_format": ["dense"]
            }
        }
    },
    ....
}
//提交任务对上传的数据集进行训练
# python fate_flow_client.py -f submit_job -d examples/test_hetero_lr_job_dsl.json -c examples/test_hetero_lr_job_conf.json

//输出结果
{
    "data": {
        "board_url": "http://fateboard:8080/index.html#/dashboard?job_id=2022041901241226828821&role=guest&party_id=9999",
        "job_dsl_path": "/data/projects/fate/python/jobs/2022041901241226828821/job_dsl.json",
        "job_runtime_conf_path": "/data/projects/fate/python/jobs/2022041901241226828821/job_runtime_conf.json",
        "logs_directory": "/data/projects/fate/python/logs/2022041901241226828821",
        "model_info": {
            "model_id": "arbiter-10000#guest-9999#host-10000#model",
            "model_version": "2022041901241226828821"
        }
    },
    "jobId": "2022041901241226828821",
    "retcode": 0,
    "retmsg": "success"
}

//用命令查看训练进度,直到全部success,此处-j后跟的是上面的jobId
# python fate_flow_client.py -f query_task -j 2022041901241226828821 | grep f_status

3. View training results

Open the browser at 127.0.0.1:8080 to enter fateboard to view the visualization task training results.

4. Delete deployment

If the deployment needs to be removed, all FATE clusters can be stopped by running the following command on the deployment machine:

# bash docker_deploy.sh --delete all

If you want to completely delete FATE deployed on the running machine, you can log in to the nodes separately and then run the command:

# cd /data/projects/fate/confs-<id>/                //此处的ID就是集群的ID
# docker-compose down
# rm -rf ../confs-<id>/

Guess you like

Origin blog.csdn.net/SAGIRIsagiri/article/details/124127258