使用KubeFATE部署多机联邦学习环境(二)

目录

一、准备工作

二、部署操作

1.生成部署脚本文件并进行部署(在部署机也就是机器A上操作)

2.验证是否部署成功

3.连通性验证

三、简单的训练和推理

1.上传数据

2.进行训练

3.查看训练结果

四、删除部署


一、准备工作

  1. 两个主机(物理机或者虚拟机,Ubuntu或Centos7系统,允许以root用户登录)
  2. 所有主机安装Docker
  3. 所有主机安装Docker-Compose
  4. 部署机可以联网,所以主机相互之间可以网络互通
  5. 运行机已经下载FATE 的各组件镜像

关于如何安装docker和docker-compose,还有如何下载FATE镜像在上一篇文中有所介绍。

https://blog.csdn.net/SAGIRIsagiri/article/details/124105064https://blog.csdn.net/SAGIRIsagiri/article/details/124105064

此处两台机器均为虚拟机,CentOS 7系统,这里称之为机器A和机器B,机器A既作为部署机又作为目标机,机器A的IP地址为192.168.16.129,机器B的IP地址为192.168.16.130,均以root登录。

二、部署操作

1.生成部署脚本文件并进行部署(在部署机也就是机器A上操作)

//下载并解压Kubefate1.3的kubefate-docker-compose.tar.gz资源包
# curl -OL https://github.com/FederatedAI/KubeFATE/releases/download/v1.3.0/kubefate-docker-compose.tar.gz

# tar -xzf kubefate-docker-compose.tar.gz        //进行解压

# cd docker-deploy/            //进入docker-deploy目录
 
# vi parties.conf              //编辑parties.conf配置文件
 
user=root                                   
dir=/data/projects/fate                     
partylist=(10000 9999)          //此处为两个集群的ID            
partyiplist=(192.168.16.129 192.168.16.130)       //此处写入两个目标机的IP
servingiplist=(192.168.16.129 192.168.16.130)     //此处写入两个目标机的IP
exchangeip=
 
# bash generate_config.sh          //生成部署文件
 
# bash docker_deploy.sh all        //执行启动部署集群脚本
//需要输入几次目标机的root密码

2.验证是否部署成功

分别在目标机A、B上验证。

# docker ps                    //集群A(ID 10000)
CONTAINER ID   IMAGE                                      COMMAND                  CREATED          STATUS          PORTS                                                                                            NAMES
6186cc50baa1   federatedai/serving-proxy:1.2.2-release    "/bin/sh -c 'java -D…"   14 minutes ago   Up 12 minutes   0.0.0.0:8059->8059/tcp, :::8059->8059/tcp, 0.0.0.0:8869->8869/tcp, :::8869->8869/tcp, 8879/tcp   serving-10000_serving-proxy_1
870a3048336b   federatedai/serving-server:1.2.2-release   "/bin/sh -c 'java -c…"   14 minutes ago   Up 12 minutes   0.0.0.0:8000->8000/tcp, :::8000->8000/tcp                                                        serving-10000_serving-server_1
9a594365a451   redis:5                                    "docker-entrypoint.s…"   14 minutes ago   Up 12 minutes   6379/tcp                                                                                         serving-10000_redis_1
44a0df69d2b1   federatedai/egg:1.3.0-release              "/bin/sh -c 'cd /dat…"   18 minutes ago   Up 17 minutes   7778/tcp, 7888/tcp, 50000-60000/tcp                                                              confs-10000_egg_1
22fe1f5e1ec1   federatedai/federation:1.3.0-release       "/bin/sh -c 'java -c…"   18 minutes ago   Up 17 minutes   9394/tcp                                                                                         confs-10000_federation_1
f75f0405b4bc   mysql:8                                    "docker-entrypoint.s…"   18 minutes ago   Up 17 minutes   3306/tcp, 33060/tcp                                                                              confs-10000_mysql_1
a503e90b1548   redis:5                                    "docker-entrypoint.s…"   18 minutes ago   Up 17 minutes   6379/tcp                                                                                         confs-10000_redis_1
b09a08468ad3   federatedai/proxy:1.3.0-release            "/bin/sh -c 'java -c…"   18 minutes ago   Up 17 minutes   0.0.0.0:9370->9370/tcp, :::9370->9370/tcp                                                        confs-10000_proxy_1
# docker ps                    //集群B(ID 9999)
CONTAINER ID   IMAGE                                    COMMAND                  CREATED          STATUS          PORTS                                       NAMES
27262d0be615   federatedai/roll:1.3.0-release           "/bin/sh -c 'java -c…"   10 minutes ago   Up 9 minutes    8011/tcp                                    confs-9999_roll_1
e0b244d55562   federatedai/meta-service:1.3.0-release   "/bin/sh -c 'java -c…"   11 minutes ago   Up 10 minutes   8590/tcp                                    confs-9999_meta-service_1
6e249db9451c   federatedai/egg:1.3.0-release            "/bin/sh -c 'cd /dat…"   12 minutes ago   Up 10 minutes   7778/tcp, 7888/tcp, 50000-60000/tcp         confs-9999_egg_1
8db5215d3998   mysql:8                                  "docker-entrypoint.s…"   12 minutes ago   Up 11 minutes   3306/tcp, 33060/tcp                         confs-9999_mysql_1
d16f4c43fb05   federatedai/proxy:1.3.0-release          "/bin/sh -c 'java -c…"   12 minutes ago   Up 11 minutes   0.0.0.0:9370->9370/tcp, :::9370->9370/tcp   confs-9999_proxy_1
b5062d978a12   federatedai/federation:1.3.0-release     "/bin/sh -c 'java -c…"   12 minutes ago   Up 11 minutes   9394/tcp                                    confs-9999_federation_1
ad673a6e2c4a   redis:5                                  "docker-entrypoint.s…"   12 minutes ago   Up 11 minutes   6379/tcp                                    confs-9999_redis_1

3.连通性验证

在部署机上运行以下命令(机器A)

# docker exec -it confs-10000_python_1 bash        //进入部署机的python容器

# cd /data/projects/fate/python/examples/toy_example  //进入测试脚本文件夹

# python run_toy_example.py 10000 9999 1           //运行测试脚本,1代表多机

测试成功会返回以下内容

"2019-08-29 07:21:25,353 - secure_add_guest.py[line:96] - INFO: begin to init parameters of secure add example guest"
"2019-08-29 07:21:25,354 - secure_add_guest.py[line:99] - INFO: begin to make guest data"
"2019-08-29 07:21:26,225 - secure_add_guest.py[line:102] - INFO: split data into two random parts"
"2019-08-29 07:21:29,140 - secure_add_guest.py[line:105] - INFO: share one random part data to host"
"2019-08-29 07:21:29,237 - secure_add_guest.py[line:108] - INFO: get share of one random part data from host"
"2019-08-29 07:21:33,073 - secure_add_guest.py[line:111] - INFO: begin to get sum of guest and host"
"2019-08-29 07:21:33,920 - secure_add_guest.py[line:114] - INFO: receive host sum from guest"
"2019-08-29 07:21:34,118 - secure_add_guest.py[line:121] - INFO: success to calculate secure_sum, it is 2000.0000000000002"

这样两台机器之间的FATE联邦学习环境就搭建完成了。

三、验证Serving-Service功能

使用部署好的两个FATE集群进行简单的训练和推理测试,训练用到的数据集是“breast”,是FATE自带的简单测试用数据集,放在"examples/data"中,其分为两部分“breast_a”和“breast_b”,参与训练的host方持有“breast_a”,而guest方持有“breast_b”。guest和host联合起来对数据集进行逻辑回归训练。最后训练完成的模型推送到FATE Serving中作在线推理。

1.上传数据

以下操作在机器A上进行

# docker exec -it confs-10000_python_1 bash            //进入python容器

# cd fate_flow                                         //进入fate_flow目录

# vi examples/upload_host.json                         //修改上传配置文件

{
  "file": "examples/data/breast_a.csv",
  "head": 1,
  "partition": 10,
  "work_mode": 1,
  "namespace": "fate_flow_test_breast",
  "table_name": "breast"
}

//将“breast_a.csv”上传到系统中
# python fate_flow_client.py -f upload -c examples/upload_host.json  

以下操作在机器B上进行

# docker exec -it confs-9999_python_1 bash            //进入python容器

# cd fate_flow                                         //进入fate_flow目录

# vi examples/upload_guest.json                         //修改上传配置文件

{
  "file": "examples/data/breast_b.csv",
  "head": 1,
  "partition": 10,
  "work_mode": 1,
  "namespace": "fate_flow_test_breast",
  "table_name": "breast"
}

//将“breast_b.csv”上传到系统中
# python fate_flow_client.py -f upload -c examples/upload_guest.json  

2.进行训练

# vi examples/test_hetero_lr_job_conf.json        //修改训练用配置文件

{
    "initiator": {
        "role": "guest",
        "party_id": 9999
    },
    "job_parameters": {
        "work_mode": 1
    },
    "role": {
        "guest": [9999],
        "host": [10000],
        "arbiter": [10000]
    },
    "role_parameters": {
        "guest": {
            "args": {
                "data": {
                    "train_data": [{"name": "breast", "namespace": "fate_flow_test_breast"}]
                }
            },
            "dataio_0":{
                "with_label": [true],
                "label_name": ["y"],
                "label_type": ["int"],
                "output_format": ["dense"]
            }
        },
        "host": {
            "args": {
                "data": {
                    "train_data": [{"name": "breast", "namespace": "fate_flow_test_breast"}]
                }
            },
             "dataio_0":{
                "with_label": [false],
                "output_format": ["dense"]
            }
        }
    },
    ....
}
//提交任务对上传的数据集进行训练
# python fate_flow_client.py -f submit_job -d examples/test_hetero_lr_job_dsl.json -c examples/test_hetero_lr_job_conf.json

//输出结果
{
    "data": {
        "board_url": "http://fateboard:8080/index.html#/dashboard?job_id=2022041901241226828821&role=guest&party_id=9999",
        "job_dsl_path": "/data/projects/fate/python/jobs/2022041901241226828821/job_dsl.json",
        "job_runtime_conf_path": "/data/projects/fate/python/jobs/2022041901241226828821/job_runtime_conf.json",
        "logs_directory": "/data/projects/fate/python/logs/2022041901241226828821",
        "model_info": {
            "model_id": "arbiter-10000#guest-9999#host-10000#model",
            "model_version": "2022041901241226828821"
        }
    },
    "jobId": "2022041901241226828821",
    "retcode": 0,
    "retmsg": "success"
}

//用命令查看训练进度,直到全部success,此处-j后跟的是上面的jobId
# python fate_flow_client.py -f query_task -j 2022041901241226828821 | grep f_status

3.查看训练结果

浏览器打开127.0.0.1:8080进入fateboard,查看可视化任务训练结果。

四、删除部署

如果需要删除部署,则在部署机器上运行以下命令可以停止所有FATE集群:

# bash docker_deploy.sh --delete all

如果想要彻底删除在运行机器上部署的FATE,可以分别登录节点,然后运行命令:

# cd /data/projects/fate/confs-<id>/                //此处的ID就是集群的ID
# docker-compose down
# rm -rf ../confs-<id>/

猜你喜欢

转载自blog.csdn.net/SAGIRIsagiri/article/details/124127258