Unlock cloud native AI skills - learning to develop your machine workflow

 

According to the article | "unlock cloud native AI skills to build a machine learning system on Kubernetes" after a Kubeflow Pipelines built, we were a small test chopper, with a real case, learn how to develop a set of machine learning based Kubeflow Pipelines workflow.

Ready to work

Machine learning is a task workflow-driven processes, but also the data-driven process, here it comes to preparation and data import, export, Checkpoint's assessment model training, export to the final model. This requires a distributed storage medium of transmission used here as a distributed NAS storage.

  • Creating a distributed storage, NAS here for an example. Here  NFS_SERVER_IP you need to replace the real NAS server address
  1. Creating Ali cloud NAS services can refer to the document
  2. The need to create NFS Server /data
# mkdir -p /nfs
# mount -t nfs -o vers=4.0 NFS_SERVER_IP:/ /nfs
# mkdir -p /data
# cd /
# umount /nfs

 

  1. Create the corresponding Persistent Volume
# cat nfs-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: user-susan
  labels:
    user-susan: pipelines
spec:
  persistentVolumeReclaimPolicy: Retain
  capacity:
    storage: 10Gi
  accessModes:
  - ReadWriteMany
  nfs:
    server: NFS_SERVER_IP
    path: "/data"
    
# kubectl create -f nfs-pv.yaml
创建 Persistent Volume Claim
# cat nfs-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: user-susan
  annotations:
    description: "this is the mnist demo"
    owner: Tom
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
       storage: 5Gi
  selector:
    matchLabels:
      user-susan: pipelines
# kubectl create -f nfs-pvc.yaml

 

Development Pipeline

As the examples provided Kubeflow Pipelines are dependent on Google's storage service, which leads to domestic users can not really experience the ability Pipelines. To this end, Ali cloud services team provides an example of a container-based NAS storage MNIST training model , easy to use and learn Kubeflow Pipelines on Ali cloud. Specific steps in three steps: 

  • (1) Download data 
  • (2) the use of the model train TensorFlow 
  • (3) Export models

In these three steps, a step after a step are dependent on the previous completed.

Kubeflow Pipelines Such a process may be described with Python code, the code can view the complete  standalone_pipeline.py .

We use the open source project based on the examples  Arena  's  arena_op , which is the default for Kubeflow  container_op package, it enables seamless MPI for distributed training and PS mode, in addition also supports the use of GPU and RDMA and other heterogeneous devices and distributed simple access storage, while the synchronization code from git convenient source, is a more practical tool API. 

@dsl.pipeline(
  name='pipeline to run jobs',
  description='shows how to run pipeline jobs.'
)
def sample_pipeline(learning_rate='0.01',
    dropout='0.9',
    model_version='1',
    commit='f097575656f927d86d99dd64931042e1a9003cb2'):
  """A pipeline for end to end machine learning workflow."""
  data=["user-susan:/training"]
  gpus=1
# 1. prepare data
  prepare_data = arena.standalone_job_op(
    name="prepare-data",
    image="byrnedo/alpine-curl",
    data=data,
    command="mkdir -p /training/dataset/mnist && \
  cd /training/dataset/mnist && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-images-idx3-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-labels-idx1-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-images-idx3-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-labels-idx1-ubyte.gz")
  # 2. downalod source code and train the models
  train = arena.standalone_job_op(
    name="train",
    image="tensorflow/tensorflow:1.11.0-gpu-py3",
    sync_source="https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git",
    env=["GIT_SYNC_REV=%s" % (commit)],
    gpus=gpus,
    data=data,
    command='''
    echo %s;python code/tensorflow-sample-code/tfjob/docker/mnist/main.py \
    --max_steps 500 --data_dir /training/dataset/mnist \
    --log_dir /training/output/mnist  --learning_rate %s \
    --dropout %s''' % (prepare_data.output, learning_rate, dropout),
    metrics=["Train-accuracy:PERCENTAGE"])
  # 3. export the model
  export_model = arena.standalone_job_op(
    name="export-model",
    image="tensorflow/tensorflow:1.11.0-py3",
    sync_source="https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git",
    env=["GIT_SYNC_REV=%s" % (commit)],
    data=data,
    command="echo %s;python code/tensorflow-sample-code/tfjob/docker/mnist/export_model.py --model_version=%s --checkpoint_path=/training/output/mnist /training/output/models" % (train.output, model_version))

 

Kubeflow Pipelines above code will be transformed into a directed acyclic graph (the DAG), wherein each node is the Component (component), and the connection between the Component (component) represents the dependencies between them. Pipelines UI can be seen from the DAG:

First, understand what specific part of the data preparation, where we provide  arena.standalone_job_op the Python API, you need to specify the steps of 名称: name;  需要使用的容器镜像: Image;  要使用的数据以及其对应到容器内部的挂载目录: the Data.

Here the data is an array format, such as data = [ "user-susan: / training"], it can be mounted to a plurality of data represents. Which  user-susan is the Persistent Volume Claim previously created, and  /training is mounted inside the container directory.

prepare_data = arena.standalone_job_op(
    name="prepare-data",
    image="byrnedo/alpine-curl",
    data=data,
    command="mkdir -p /training/dataset/mnist && \
  cd /training/dataset/mnist && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-images-idx3-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-labels-idx1-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-images-idx3-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-labels-idx1-ubyte.gz")

 

The above steps are actually using the data from the specified address curl to download the corresponding distributed storage directory  /training/dataset/mnist, please pay attention to where the  /training root directory is distributed storage, similar to the familiar root mount point; while  /training/dataset/mnist a subdirectory. In fact, the latter step by using the same root mount point, data read, calculates.

The second step is the use of data downloaded to the distributed storage, and download the code designated by the commit id fixed Git, and model training.

train = arena.standalone_job_op(
    name="train",
    image="tensorflow/tensorflow:1.11.0-gpu-py3",
    sync_source="https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git",
    env=["GIT_SYNC_REV=%s" % (commit)],
    gpus=gpus,
    data=data,
    command='''
    echo %s;python code/tensorflow-sample-code/tfjob/docker/mnist/main.py \
    --max_steps 500 --data_dir /training/dataset/mnist \
    --log_dir /training/output/mnist  --learning_rate %s \
    --dropout %s''' % (prepare_data.output, learning_rate, dropout),
    metrics=["Train-accuracy:PERCENTAGE"])

 

This step can be seen than the data preparation is relatively complex, in addition to the first step and the name, image, data and command, like the need to specify in the model training step, you also need to specify:

  • Get code ways:  from the perspective of a reproducible experimental point of view, for the running test to trace the origin of the code, is a very important part. You can be specified when the API call  sync_source is git code source, and by setting  env the  GIT_SYNC_REV specified training commit id code;
  • GPU:   Default is 0, the GPU is not used; if the value is an integer greater than 0, represents the number of steps required this amount of the GPU;
  • metrics:   the same is reproduced from the test object and may be comparable, the user may need a set of indicators derived, and visually displaying and comparing Pipelines UI. Specific use two steps: 1. Specify metrics to collect metrics name and index display format, or RAW PERCENTAGE form of an array when calling API, for example  metrics=["Train-accuracy:PERCENTAGE"]. 2. As Pipelines default metrics collected from the log stdout, you need to output a real model code running {metrics name} = {value} or {metrics name}: {value} , specific reference may sample code .

It is worth noting:

Specified in this step and the  prepare_data same  data parameter [ "user-susan: / training "], so as to read data corresponding to the training code, for example  --data_dir /training/dataset/mnist.

Since this additional step is dependent on  prepare_data, you can be specified by a method  prepare_data.output represented dependency two steps.

The last  export_model is based on the  train checkpoint training produced, generating a training model:

export_model = arena.standalone_job_op(
    name="export-model",
    image="tensorflow/tensorflow:1.11.0-py3",
    sync_source="https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git",
    env=["GIT_SYNC_REV=%s" % (commit)],
    data=data,
    command="echo %s;python code/tensorflow-sample-code/tfjob/docker/mnist/export_model.py --model_version=%s --checkpoint_path=/training/output/mnist /training/output/models" % (train.output, model_version))

 

export_model And the second step  train is similar to, or even more simply, it's just export the code from the git synchronization model and take advantage of the shared directory  /training/output/mnist checkpoint execution model is derived.

Entire workflow looks very intuitive, the following method can define a Python together throughout the entire process:

@dsl.pipeline(
  name='pipeline to run jobs',
  description='shows how to run pipeline jobs.'
)
def sample_pipeline(learning_rate='0.01',
    dropout='0.9',
    model_version='1',
    commit='f097575656f927d86d99dd64931042e1a9003cb2'):

 

@ dsl.pipeline is a decorator workflow, the decorator need to define two properties, respectively  name , and   description.

Method inlet  sample_pipeline defined four parameters:  learning_ratedropoutmodel_version and  commit, respectively, may be in the above  train and the  export_model use phase. Value of the parameter is in fact the   dsl.PipelineParam  type, defined as the object is to dsl.PipelineParam Kubeflow Pipelines can convert it to a native UI input form, the form of the keyword is the parameter name, the default value of the parameter value . Notably, dsl.PipelineParam corresponding value is in fact the only numeric and string; and the array map, as well as custom types are not transformed by the transformation.

In fact, these parameters can be overwritten when the user submits the workflow, the workflow is submitted corresponding UI:

Submit Pipeline

Python DSL you can in your own Kubernetes the front of the development workflow submitted to Kubeflow Pipelines service, the actual submission of the code is very simple:

KFP_SERVICE="ml-pipeline.kubeflow.svc.cluster.local:8888"
  import kfp.compiler as compiler
  compiler.Compiler().compile(sample_pipeline, __file__ + '.tar.gz')
  client = kfp.Client(host=KFP_SERVICE)
  try:
    experiment_id = client.get_experiment(experiment_name=EXPERIMENT_NAME).id
  except:
    experiment_id = client.create_experiment(EXPERIMENT_NAME).id
  run = client.run_pipeline(experiment_id, RUN_ID, __file__ + '.tar.gz',
                            params={'learning_rate':learning_rate,
                                     'dropout':dropout,
                                    'model_version':model_version,
                                    'commit':commit})

 

Using  compiler.compile the Python code into the execution engine (the Argo) identified DAG profile;

End to create or find an existing experiment by customers Kubeflow Pipeline, and compile prior to submission of DAG configuration file.

Prepare a python3 environment within the cluster and install Kubeflow Pipelines SDK:

# kubectl create job pipeline-client --namespace kubeflow --image python:3 -- sleep infinity
# kubectl  exec -it -n kubeflow $(kubectl get po -l job-name=pipeline-client -n kubeflow | grep -v NAME| awk '{print $1}') bash

 

After logging in to Python3 environment, execute the following command, submitted to two consecutive tasks of different parameters:

# pip3 install http://kubeflow.oss-cn-beijing.aliyuncs.com/kfp/0.1.14/kfp.tar.gz --upgrade
# pip3 install http://kubeflow.oss-cn-beijing.aliyuncs.com/kfp-arena/kfp-arena-0.4.tar.gz --upgrade
# curl -O https://raw.githubusercontent.com/cheyang/pipelines/update_standalone_sample/samples/arena-samples/standalonejob/standalone_pipeline.py
# python3 standalone_pipeline.py --learning_rate 0.0001 --dropout 0.8 --model_version 2
# python3 standalone_pipeline.py --learning_rate 0.0005 --dropout 0.8 --model_version 3

 

View run results

Log in to Kubeflow Pipelines of the UI:  HTTPS: // {Pipeline address} / pipeline / # / experiments, such as:

https://11.124.285.171/pipeline/#/experiments

Click the  Compare runs button, you can enter compare the two experiments, it takes time and precision and a series of indicators. Let the experiment can be traced back first step is to get reproducible experiments, and experiments using the Kubeflow Pipelines own management is the first step in reproducible experiments turn.

to sum up

Can achieve a run of steps Kubeflow Pipeline is needed is:

  1. Construction of the Pipeline (pipeline) in the Component minimum required execution unit (assembly), if using native defined  dsl.container_ops, the code needs to be built in two parts:
  • Construction runtime code: the container is usually constructed as a mirror image of each step, as between Pipelines and true execution of business logic code adapter. What it does is to acquire the input parameters Pipelines context, call the business logic code, and need to be passed to the output of the next step into the specified location in the vessel according to the rules of Pipelines, by the underlying workflow components responsible for passing. Such a result is generated runtime code and business logic code coupled together. Can refer to  examples of Kubeflow Pipelines ;
  • Build client code: This step is usually grown under way, familiar Kubernetes friends will find this step is actually writing Pod Spec:
container_op = dsl.ContainerOp(
        name=name,
        image='<train-image>',
        arguments=[
            '--input_dir', input_dir,
            '--output_dir', output_dir,
            '--model_name', model_name,
            '--model_version', model_version,
            '--epochs', epochs
        ],
        file_outputs={'output': '/output.txt'}
    )
container_op.add_volume(k8s_client.V1Volume(
            host_path=k8s_client.V1HostPathVolumeSource(
                path=persistent_volume_path),
            name=persistent_volume_name))
container_op.add_volume_mount(k8s_client.V1VolumeMount(
            mount_path=persistent_volume_path,
            name=persistent_volume_name))

 

The use of native definitions of  dsl.container_ops benefits that flexible, due to the open and interactive interface Pipelines, the user can do many things in container_ops this level. But the problem is that it:

  • The degree of multiplexing low. Each Component and development are required to build the mirror runtime code;
  • High complexity. Users need to understand the concept Kubernetes, such as resource limit, PVC, node selector and a series of concept;
  • Support distributed training difficult. Because  container_op , if you need to support distributed training you need to submit and manage tasks in a similar TFJob container_ops in a single container operation. This will bring the dual challenges of complexity and security, complexity is better understood, the safety authority to submit TFJob say such a task would require additional permissions to open Pipeline developers.

Another way is to use  arena_op this can be reused the Component API, it is common to use run-time code, without having to repeat the work of constructing the runtime code; while using a common set of  arena_op API simplify user interaction; Parameter Server also supports and MPI and other scenes. We recommend that you use this method to compile Pipelines.

  1. The constructed the Component (component) spliced ​​into the Pipeline (pipeline);
  2. The Pipeline (Pipeline) compiled into the execution engine Argo (Argo) to identify the profile of DAG, DAG configuration file and submit to Kubeflow Pipelines, use Kubeflow Pipelines own UI view the process results.

Guess you like

Origin www.cnblogs.com/alisystemsoftware/p/11271924.html