Elastic deep learning training tool on Kubernetes-Elastic Training Operator

background

Due to the natural advantages of cloud computing in terms of resource cost and elastic expansion, more and more customers are willing to build AI systems on the cloud, and cloud native technologies represented by containers and Kubernetes have become the shortest path to release the value of the cloud. It has become a trend to build an AI platform based on Kubernetes.

When faced with more complex model training or large amounts of data, the computing power of a single machine often cannot meet the computing power requirements. By using distributed training frameworks such as Ali's AiACC or community's horovod, only a few lines of code can be modified to expand a single-machine training task to support distributed training tasks. It is common on Kubernetes that the tf-operator of the kubeflow community supports the Tensorflow PS mode, or the mpi-operator supports the mpi allreduce mode of horovod.

status quo

Kubernetes and cloud computing provide agility and scalability. We can use cluster-AutoScaler and other components to set elastic strategies for training tasks, and use the elastic capabilities of Kubernetes to create on-demand to reduce GPU equipment idling.

However, this scaling mode is still slightly inadequate for offline tasks like training:

  • Fault tolerance is not supported. When some workers fail due to equipment reasons, the entire task needs to be stopped and restarted.
  • Training tasks generally take a long time, take up a lot of computing power, and tasks lack flexibility. When resources are insufficient, unless the task is terminated, resources cannot be freed up for other businesses on demand.
  • Training tasks take a long time, do not support worker dynamic configuration, and cannot safely use preemptive instances to maximize the cost-effectiveness on the cloud

How to give flexibility to training tasks is a key way to improve cost performance. Recently, distributed frameworks such as horovod have gradually supported Elastic Training, that is, flexible training capabilities. That is, a training task is allowed to dynamically expand or shrink the training worker during the execution process, and never cause the interruption of the training task. Need to make a small amount of modification and adaptation in the code, please refer to: https://horovod.readthedocs.io/en/stable/elastic_include.html .

If you are interested in the implementation principle of Elastic training, you can read this Elastic Horovod design document. This article will not introduce it in detail.

In mpi-operator, the workers participating in the training are designed and maintained as static resources. After supporting the flexible training mode, it adds flexibility to the task and also brings challenges to the operation and maintenance layer, such as:

  • You must use the horovordrun provided by horovod as the entrance. In horovod, the launcher uses ssh to log in to the worker, and the login tunnel between the launcher and the worker needs to be opened.
  • The Elastic Driver module responsible for calculating elasticity obtains the latest worker topology information by specifying the discover_host script, thereby starting or stopping the worker instance. When the worker changes, first update the return value of the discover_host script.
  • In scenarios such as preemption or price calculation, sometimes it is necessary to specify worker shrinkage. K8s native orchestration primitive deployment, statefulset cannot meet the specified shrinkage scenario.

Solution

In response to the above problems, we designed and developed et-operator to provide TrainingJob CRD to describe training tasks, ScaleOut and ScaleIn CRD to describe expansion and contraction operations, through their combination, make our training tasks more flexible. Open source this program, welcome everyone to raise demand, exchanges, and complaints.

Open source solution address: https://github.com/AliyunContainerService/et-operator

design

TrainingJob Controller mainly has the following functions:

  • Maintain the creation/delete life cycle of TrainingJob, and sub-resource management.
  • Perform expansion and contraction operations.
  • Fault tolerance, when a worker is evicted, a new worker is created and added to the training.

1. Resource Creation

The sequence of creating TrainingJob sub-resources is as follows:

  • Create the key pair needed to get through ssh and create a secret.
  • Create workers, including service and pod, and mount the secret public key.
  • Create configmap, including discover_host script and hostfile file.
  • Create a launcher and mount configmap. Since the hostfile will be modified with the topology relationship in the future, the hostfile is copied from the configmap to a separate directory through the initcontainer alone.

TrainingJob related resources:

The configuration of TrainingJob CR is divided into Lanucher and Worker. Specify task mirroring and start execution in Launcher. By default, et-operator will generate a hostfile file and discover_host script based on worker distribution. The discover_host script is mounted to Launcher’s /etc/edl/discover_hosts.sh file, in the entry script It is specified by the --host-discovery-script parameter during horovodrun execution. Specify the image and GPU usage of the worker in the Worker settings, and specify the allowable range of the number of worker replicas through maxReplicas / minReplicas.

apiVersion: kai.alibabacloud.com/v1alpha1
kind: TrainingJob
metadata:
  name: elastic-training
  namespace: default
spec:
  cleanPodPolicy: Running
  etReplicaSpecs:
    launcher:
      replicas: 1
      template:
        spec:
          containers:
          - command:
            - sh
            - -c
            - horovodrun -np 2 --min-np 1 --max-np 9 --host-discovery-script
              /etc/edl/discover_hosts.sh python /examples/elastic/tensorflow2_mnist_elastic.py
            image: registry.cn-huhehaote.aliyuncs.com/lumo/horovod:master-tf2.1.0-torch1.4.0-mxnet-py3.6-gpu
            imagePullPolicy: Always
            name: mnist-elastic
    worker:
      maxReplicas: 9
      minReplicas: 1
      replicas: 2
      template:
        spec:
          containers:
          - image: registry.cn-huhehaote.aliyuncs.com/lumo/horovod:master-tf2.1.0-torch1.4.0-mxnet-py3.6-gpu
            imagePullPolicy: Always
            name: mnist-elastic
            resources:
              limits:
                nvidia.com/gpu: "1"
              requests:
                nvidia.com/gpu: "1"
status:
  currentWorkers:
  - elastic-training-worker-0
  - elastic-training-worker-1
  - elastic-training-worker-2
  - elastic-training-worker-3
  phase: Succeeded
  replicaStatuses:
    Launcher:
      active: 1
      succeeded: 1
    Worker:
      active: 4

2. Worker expansion/reduction

In addition to TrainingJob, et-operator supports both ScaleOut and ScaleIn CRDs, and delivers training task expansion and contraction operations.

When a ScaleOut CR is issued, the ScaleOutController triggers Reconcile. The work here is very simple. According to the Selector field in the ScaleOut CR, find the TrainingJob corresponding to the Scaler and set it to the OwnerReferences of the CR.

Take a ScaleOut operation as an example:

- apiVersion: kai.alibabacloud.com/v1alpha1
  kind: ScaleOut
  metadata:
    creationTimestamp: "2020-11-04T13:54:26Z
    name: scaleout-ptfnk
    namespace: default
    ownerReferences:
    - apiVersion: kai.alibabacloud.com/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: TrainingJob
      name: elastic-training // 指向扩容对象TrainingJob
      uid: 075b9c4a-22f9-40ce-83c7-656b329a2b9e
  spec:
  selector:
    name: elastic-training
  toAdd:
    count: 2

The TrainingJobController monitors that the ScaleOut CR belonging to the TrainingJob is updated, triggers the Reconcile of the TrainingJob, traverses and filters the ScaleIn and ScaleOut pointed to by the OwnerReference under the TrainingJob, and determines the expansion or contraction performed according to the creation time and the status time.

apiVersion: kai.alibabacloud.com/v1alpha1
kind: TrainingJob
metadata:
  name: elastic-training
  namespace: default
spec: 
  // ...... Launcher and Worker spec
status:
  currentScaler: ScaleIn:default/scaleout-ptfnk
  phase: Scaling
  currentWorkers:
  - elastic-training-worker-0
  - elastic-training-worker-1

ScaleOut mission CR:

ScaleIn mission CR:

Detailed work process:

run

1. Install ET-Operator

mkdir -p $(go env GOPATH)/src/github.com/aliyunContainerService
cd $(go env GOPATH)/src/github.com/aliyunContainerService
git clone https://http://github.com/aliyunContainerService/et-operator
cd et-operator
kubectl create -f deploy/all_in_one.yaml 

Check the installation of crd:

# kubectl get crd
NAME                                    CREATED AT
scaleins.kai.alibabacloud.com           2020-11-11T11:16:13Z
scaleouts.kai.alibabacloud.com          2020-11-11T11:16:13Z
trainingjobs.kai.alibabacloud.com       2020-11-11T11:16:13Z

Check the running status of the controller, which is installed in kube-ai by default:

# kubectl -n kube-ai get po
NAME                                         READY   STATUS              RESTARTS   AGE
et-operator-controller-manager-7877968489-c5kv4   0/2     ContainerCreating   0          5s

2. Run TrainingJob

Run the prepared example:

kubectl apply -f examples/training_job.yaml

Check running status:

# kubectl get trainingjob
NAME                          PHASE     AGE
elastic-training              Running   77s

# kubectl get po
NAME                                      READY   STATUS             RESTARTS   AGE
elastic-training-launcher                 1/1     Running            0          7s
elastic-training-worker-0                 1/1     Running            0          10s
elastic-training-worker-1                 1/1     Running            0          9s

3. Reduced training task Worker

When performing scaling, you can specify the scaled workers through the spec.toDelete.count or spec.toDelete.podNames field in ScaleIn CR.

Configure the number of shrinkage through count, and then calculate the shrinkage Worker from high to low through index.

apiVersion: kai.alibabacloud.com/v1alpha1
kind: ScaleIn
metadata:
  name: scalein-workers
spec:
  selector:
    name: elastic-training
  toDelete:
    count: 1

If you want to shrink a specific worker, you can configure podNames:

apiVersion: kai.alibabacloud.com/v1alpha1
kind: ScaleIn
metadata:
  name: scalein-workers
spec:
  selector:
    name: elastic-training
  toDelete:
    podNames:
    - elastic-training-worker-1

Run a scaling example, specify the number of scaling 1 worker:

kubectl create -f examples/scale_in_count.yaml

Detect the execution status of shrinking and training tasks:

# kubectl get scalein
NAME                                     PHASE            AGE
scalein-sample-t8jxd                     ScaleSucceeded   11s

# kubectl get po
NAME                                      READY   STATUS             RESTARTS   AGE
elastic-training-launcher                 1/1     Running            0          47s
elastic-training-worker-0                 1/1     Running            0          50s

4. Expansion training tasks

In ScaleOut CR, specify the number of workers to be expanded through the spec.toAdd.count field:

apiVersion: kai.alibabacloud.com/v1alpha1
  kind: ScaleOut
  metadata:
    name: elastic-training-scaleout-9dtmw
    namespace: default
  spec:
    selector:
      name: elastic-training
    timeout: 300
    toAdd:
      count: 2

Run example:

kubectl create -f examples/scale_out.yaml

Detect the execution status of shrinking and training tasks:

kubectl get scaleout
NAME                                     PHASE            AGE
elastic-training-scaleout-9dtmw          ScaleSucceeded   30s
kubectl get po
NAME                                      READY   STATUS             RESTARTS   AGE
elastic-training-launcher                 1/1     Running            0          2m5s
elastic-training-worker-0                 1/1     Running            0          2m8s
elastic-training-worker-1                 1/1     Running            0          40s
elastic-training-worker-2                 1/1     Running            0          40s

to sum up

ET-Operator provides a set of training and scaling CRDs and Controllers, allowing us to easily run flexible distributed training on Kubernetes, support the distribution of distributed training tasks, and through the integration of the distributed framework, run in training tasks During the process, the workers involved in the calculation are dynamically expanded and reduced. To make our training tasks flexible, combined with preemptive instances, can better utilize the resource elasticity and cost-effective advantages of the cloud.

Author | Xu Xiaozhou (Xiao Yuan)

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/113863716