Kubeflow survey

Kubeflow basic concepts

Kubeflow is a Google-led Kubernetes and machine learning workflow integration framework that helps machine learning tasks run better in the cloud environment, perform distributed processing, and expand to a large number of machines. It can be transplanted to different platforms and observe the operation of the model. Effect and so on.

Things that Kubeflow can do include:

  • data preparation
  • model training
  • prediction serving
  • service management

The machine learning workflow is divided into two stages : development process and production process

Insert picture description here

Figure 1. Development process and production process

Kubeflow has the following concepts:

  • Pipeline -a machine learning workflow pipeline that executes a series of calculation steps and consists of multiple components

  • Component -a calculation task in the workflow, which is equivalent to a python function, has fixed inputs and outputs, and is dependent on each other

  • Experiment -a configuration environment for workflow, a set of execution parameters

  • Run -represents an execution of the Pipeline in an Experiment environment

  • Recurring Run -is a Run that will be executed repeatedly at regular intervals, also known as Job

  • Step -corresponds to the execution of a Component in Run

  • Artifact -an input or output to get a data set

Insert picture description here

Figure 2. Kubeflow overall design

Create Kubeflow Component

In the design of Kubeflow, each Component is one python 函数, which is packaged into Docker 容器multiple components to form a Pipeline , which is submitted to Kubernetes for execution, and the specified computing resource requirements are allocated as required, and the server of Kubeflow Pipeline manages it. Specify the s3 path of input and output data, which will be loaded by the system. Run records, Pipeline configuration, and run results can be viewed in Kubeflow UI, and new Runs can be created.

Each Component is a specific computing task and supports multiple machine learning frameworks, such as Tensorflow, PyTorch, MXNet, and MPI. In addition to being defined by YAML files , pipelines can also be dynamically created in Python scripts or in Jupyter Notebook . In addition to the single execution of the Pipeline, ** also supports the deployment of the computing model into a service in the manner of Serving, and monitors the status of Serving. **Task dependency is managed by Argo . Each computing task has a corresponding Operator scheduling , which controls the scheduling and resource allocation of the underlying Kubernetes. The entire system can run on different cloud platforms.

Related documents :

Defined by Yaml

Code 1 The definition of a Component:

name: xgboost4j - Train classifier
description: Trains a boosted tree ensemble classifier using xgboost4j

inputs:
- {
    
    name: Training data}
- {
    
    name: Rounds, type: Integer, default: '30', help: Number of training rounds}

outputs:
- {
    
    name: Trained model, type: XGBoost model, help: Trained XGBoost model}

implementation:
  container:
    image: gcr.io/ml-pipeline/xgboost-classifier-train@sha256:b3a64d57
    command: [
      /ml/train.py,
      --train-set, {
    
    inputPath: Training data},
      --rounds,    {
    
    inputValue: Rounds},
      --out-model, {
    
    outputPath: Trained model},
    ]
  • name-the name of the Component

  • description-task description

  • inputs-input parameter list, you can define name, type, default value, etc.

  • outputs-output parameter list

  • implementation-the description of the calculation task, where a Docker image is specified, startup parameters, and template parameters are specified

You can create a Pipeline through python code and use the decorator to identify it. The parameters of the function are the parameters of the entire Pipeline . The intermediate steps will not be executed directly. Instead, a calculation graph is created. Each step is a Component, which is handed over to Kubernetes for distributed processing .

Create in code

Code 2 The structure of a Pipeline:

from kfp import dsl
from kubernetes.client.models import V1EnvVar, V1SecretKeySelector


@dsl.pipeline(
    name='foo',
    description='hello world')
def foo_pipeline(tag: str, pull_image_policy: str):

    # any attributes can be parameterized (both serialized string or actual PipelineParam)
    op = dsl.ContainerOp(name='foo',
                        image='busybox:%s' % tag,
                        # pass in init_container list
                        init_containers=[dsl.InitContainer('print', 'busybox:latest', command='echo "hello"')],
                        # pass in sidecars list
                        sidecars=[dsl.Sidecar('print', 'busybox:latest', command='echo "hello"')],
                        # pass in k8s container kwargs
                        container_kwargs={
    
    'env': [V1EnvVar('foo', 'bar')]},
    )

    # set `imagePullPolicy` property for `container` with `PipelineParam`
    op.container.set_pull_image_policy(pull_image_policy)

    # add sidecar with parameterized image tag
    # sidecar follows the argo sidecar swagger spec
    op.add_sidecar(dsl.Sidecar('redis', 'redis:%s' % tag).set_image_pull_policy('Always'))

Insert picture description here

Figure 3. Use in code via Kubeflow SDK

Created in Jupyter

Insert picture description here

Figure 4. Using Kubeflow in Notebook

Kubeflow architecture

Under Kubeflow, the Argo Workflow Controller is really responsible for the execution of the Pipeline, which submits the calculation tasks to Kubernetes.

Insert picture description here

Figure 5. Kubeflow Pipeline architecture

TensorFlow Training support (TFJob)

Specific to Tensorflow tasks, the function of distributed computing resource management is implemented in the TFJob component, which is a Kubernetes CRD based on tf-operator.

There are also corresponding components for PyTorch, MXNet, Chainer, and MPI tasks

Integration with Jupyter Notebook

In the above way, Component and Pipeline can be created through YAML and Python scripts, and also can be created through Notebook, suitable for interactive development scenarios, dynamically deploying a Python function, continuously creating and deploying new tasks, and viewing data, verifying the calculation results .

The advantage of this method is that users do not need to create a development environment locally, but only need to operate in the browser, and can perform access control. Notebooks can also be saved and the entire environment can be shared with colleagues.

KFServing

Kubeflow provides its own Serving component. The machine learning model that needs to be deployed to the production environment is service-oriented and resides in the memory. There is no need to reload the model every time it is predicted. The bottom layer of KFServing is based on Knative and Istio, and implements a serverless flexible extension service.

Insert picture description here

Figure 6. KFServing architecture diagram

Take solving a molecular attribute prediction problem as an example to explain how to implement it with kubeflow

  1. Define the environment and organize the environment dependent on the molecular attribute prediction problem into a docker image
  2. Define data and store the data that the molecular attribute prediction problem relies on to a designated location, such as S3
  3. Define multiple Component and start Pipeline
name: mol_attr_pred - Train classifier
description: Trains a Molecular attribute prediction

inputs:
- {
    
    name: Training data}
- {
    
    name: Rounds, type: Integer, default: '30', help: Number of training rounds}

outputs:
- {
    
    name: Trained model, type: Tensorflow model, help: Trained Tensorflow model}

implementation:
  container:
    image: gcr.io/ml-pipeline/mol_attr_pred-classifier-train@sha256:b3a64d57
    command: [
      /ml/train.py,
      --train-set, {
    
    inputPath: Training data},
      --rounds,    {
    
    inputValue: Rounds},
      --out-model, {
    
    outputPath: Trained model},
    ]

You can view tasks in dashboard

Insert picture description here

图7. Dashboard

Insert picture description here

Figure 8. Start a Pipeline

Kubeflow summary

Supports many machine learning frameworks, including Tensorflow, PyTorch, MXNet, TensorRT, ONNX, MPI, etc.

Highly integrated Kubernetes, Google Cloud, AWS, Azure and other cloud platforms

The core is based on concepts such as Component and Pipeline

Provide task management, historical tasks, task submission, calculation scheduling and other functions

Serving framework

Support interactive creation of tasks in Jupyter Notebook

Consists of multiple loose, independent components

The deployment is more complicated, the operation and maintenance cost is high, and it is over-coupled with the cloud platform

Complex API, many components, many concepts, and over-engineering

The documentation is messy (extremely confusing...)

Guess you like

Origin blog.csdn.net/TQCAI666/article/details/114126665