background
Due to the natural advantages of cloud computing in terms of resource cost and elastic expansion, more and more customers are willing to build AI systems on the cloud, and cloud native technologies represented by containers and Kubernetes have become the shortest path to release the value of the cloud. It has become a trend to build an AI platform based on Kubernetes.
When faced with more complex model training or large amounts of data, the computing power of a single machine often cannot meet the computing power requirements. By using distributed training frameworks such as Ali's AiACC or community's horovod, only a few lines of code can be modified to expand a single-machine training task to support distributed training tasks. It is common on Kubernetes that the tf-operator of the kubeflow community supports the Tensorflow PS mode, or the mpi-operator supports the mpi allreduce mode of horovod.
status quo
Kubernetes and cloud computing provide agility and scalability. We can use cluster-AutoScaler and other components to set elastic strategies for training tasks, and use the elastic capabilities of Kubernetes to create on-demand to reduce GPU equipment idling.
However, this scaling mode is still slightly inadequate for offline tasks like training:
- Fault tolerance is not supported. When some workers fail due to equipment reasons, the entire task needs to be stopped and restarted.
- Training tasks generally take a long time, take up a lot of computing power, and tasks lack flexibility. When resources are insufficient, unless the task is terminated, resources cannot be freed up for other businesses on demand.
- Training tasks take a long time, do not support worker dynamic configuration, and cannot safely use preemptive instances to maximize the cost-effectiveness on the cloud
How to give flexibility to training tasks is a key way to improve cost performance. Recently, distributed frameworks such as horovod have gradually supported Elastic Training, that is, flexible training capabilities. That is, a training task is allowed to dynamically expand or shrink the training worker during the execution process, and never cause the interruption of the training task. Need to make a small amount of modification and adaptation in the code, please refer to: https://horovod.readthedocs.io/en/stable/elastic_include.html .
If you are interested in the implementation principle of Elastic training, you can read this Elastic Horovod design document. This article will not introduce it in detail.
In mpi-operator, the workers participating in the training are designed and maintained as static resources. After supporting the flexible training mode, it adds flexibility to the task and also brings challenges to the operation and maintenance layer, such as:
- You must use the horovordrun provided by horovod as the entrance. In horovod, the launcher uses ssh to log in to the worker, and the login tunnel between the launcher and the worker needs to be opened.
- The Elastic Driver module responsible for calculating elasticity obtains the latest worker topology information by specifying the discover_host script, thereby starting or stopping the worker instance. When the worker changes, first update the return value of the discover_host script.
- In scenarios such as preemption or price calculation, sometimes it is necessary to specify worker shrinkage. K8s native orchestration primitive deployment, statefulset cannot meet the specified shrinkage scenario.
Solution
In response to the above problems, we designed and developed et-operator to provide TrainingJob CRD to describe training tasks, ScaleOut and ScaleIn CRD to describe expansion and contraction operations, through their combination, make our training tasks more flexible. Open source this program, welcome everyone to raise demand, exchanges, and complaints.
Open source solution address: https://github.com/AliyunContainerService/et-operator
design
TrainingJob Controller mainly has the following functions:
- Maintain the creation/delete life cycle of TrainingJob, and sub-resource management.
- Perform expansion and contraction operations.
- Fault tolerance, when a worker is evicted, a new worker is created and added to the training.
1. Resource Creation
The sequence of creating TrainingJob sub-resources is as follows:
- Create the key pair needed to get through ssh and create a secret.
- Create workers, including service and pod, and mount the secret public key.
- Create configmap, including discover_host script and hostfile file.
- Create a launcher and mount configmap. Since the hostfile will be modified with the topology relationship in the future, the hostfile is copied from the configmap to a separate directory through the initcontainer alone.
TrainingJob related resources:
The configuration of TrainingJob CR is divided into Lanucher and Worker. Specify task mirroring and start execution in Launcher. By default, et-operator will generate a hostfile file and discover_host script based on worker distribution. The discover_host script is mounted to Launcher’s /etc/edl/discover_hosts.sh file, in the entry script It is specified by the --host-discovery-script parameter during horovodrun execution. Specify the image and GPU usage of the worker in the Worker settings, and specify the allowable range of the number of worker replicas through maxReplicas / minReplicas.
apiVersion: kai.alibabacloud.com/v1alpha1
kind: TrainingJob
metadata:
name: elastic-training
namespace: default
spec:
cleanPodPolicy: Running
etReplicaSpecs:
launcher:
replicas: 1
template:
spec:
containers:
- command:
- sh
- -c
- horovodrun -np 2 --min-np 1 --max-np 9 --host-discovery-script
/etc/edl/discover_hosts.sh python /examples/elastic/tensorflow2_mnist_elastic.py
image: registry.cn-huhehaote.aliyuncs.com/lumo/horovod:master-tf2.1.0-torch1.4.0-mxnet-py3.6-gpu
imagePullPolicy: Always
name: mnist-elastic
worker:
maxReplicas: 9
minReplicas: 1
replicas: 2
template:
spec:
containers:
- image: registry.cn-huhehaote.aliyuncs.com/lumo/horovod:master-tf2.1.0-torch1.4.0-mxnet-py3.6-gpu
imagePullPolicy: Always
name: mnist-elastic
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
status:
currentWorkers:
- elastic-training-worker-0
- elastic-training-worker-1
- elastic-training-worker-2
- elastic-training-worker-3
phase: Succeeded
replicaStatuses:
Launcher:
active: 1
succeeded: 1
Worker:
active: 4
2. Worker expansion/reduction
In addition to TrainingJob, et-operator supports both ScaleOut and ScaleIn CRDs, and delivers training task expansion and contraction operations.
When a ScaleOut CR is issued, the ScaleOutController triggers Reconcile. The work here is very simple. According to the Selector field in the ScaleOut CR, find the TrainingJob corresponding to the Scaler and set it to the OwnerReferences of the CR.
Take a ScaleOut operation as an example:
- apiVersion: kai.alibabacloud.com/v1alpha1
kind: ScaleOut
metadata:
creationTimestamp: "2020-11-04T13:54:26Z
name: scaleout-ptfnk
namespace: default
ownerReferences:
- apiVersion: kai.alibabacloud.com/v1alpha1
blockOwnerDeletion: true
controller: true
kind: TrainingJob
name: elastic-training // 指向扩容对象TrainingJob
uid: 075b9c4a-22f9-40ce-83c7-656b329a2b9e
spec:
selector:
name: elastic-training
toAdd:
count: 2
The TrainingJobController monitors that the ScaleOut CR belonging to the TrainingJob is updated, triggers the Reconcile of the TrainingJob, traverses and filters the ScaleIn and ScaleOut pointed to by the OwnerReference under the TrainingJob, and determines the expansion or contraction performed according to the creation time and the status time.
apiVersion: kai.alibabacloud.com/v1alpha1
kind: TrainingJob
metadata:
name: elastic-training
namespace: default
spec:
// ...... Launcher and Worker spec
status:
currentScaler: ScaleIn:default/scaleout-ptfnk
phase: Scaling
currentWorkers:
- elastic-training-worker-0
- elastic-training-worker-1
ScaleOut mission CR:
ScaleIn mission CR:
Detailed work process:
run
1. Install ET-Operator
mkdir -p $(go env GOPATH)/src/github.com/aliyunContainerService
cd $(go env GOPATH)/src/github.com/aliyunContainerService
git clone https://http://github.com/aliyunContainerService/et-operator
cd et-operator
kubectl create -f deploy/all_in_one.yaml
Check the installation of crd:
# kubectl get crd
NAME CREATED AT
scaleins.kai.alibabacloud.com 2020-11-11T11:16:13Z
scaleouts.kai.alibabacloud.com 2020-11-11T11:16:13Z
trainingjobs.kai.alibabacloud.com 2020-11-11T11:16:13Z
Check the running status of the controller, which is installed in kube-ai by default:
# kubectl -n kube-ai get po
NAME READY STATUS RESTARTS AGE
et-operator-controller-manager-7877968489-c5kv4 0/2 ContainerCreating 0 5s
2. Run TrainingJob
Run the prepared example:
kubectl apply -f examples/training_job.yaml
Check running status:
# kubectl get trainingjob
NAME PHASE AGE
elastic-training Running 77s
# kubectl get po
NAME READY STATUS RESTARTS AGE
elastic-training-launcher 1/1 Running 0 7s
elastic-training-worker-0 1/1 Running 0 10s
elastic-training-worker-1 1/1 Running 0 9s
3. Reduced training task Worker
When performing scaling, you can specify the scaled workers through the spec.toDelete.count or spec.toDelete.podNames field in ScaleIn CR.
Configure the number of shrinkage through count, and then calculate the shrinkage Worker from high to low through index.
apiVersion: kai.alibabacloud.com/v1alpha1
kind: ScaleIn
metadata:
name: scalein-workers
spec:
selector:
name: elastic-training
toDelete:
count: 1
If you want to shrink a specific worker, you can configure podNames:
apiVersion: kai.alibabacloud.com/v1alpha1
kind: ScaleIn
metadata:
name: scalein-workers
spec:
selector:
name: elastic-training
toDelete:
podNames:
- elastic-training-worker-1
Run a scaling example, specify the number of scaling 1 worker:
kubectl create -f examples/scale_in_count.yaml
Detect the execution status of shrinking and training tasks:
# kubectl get scalein
NAME PHASE AGE
scalein-sample-t8jxd ScaleSucceeded 11s
# kubectl get po
NAME READY STATUS RESTARTS AGE
elastic-training-launcher 1/1 Running 0 47s
elastic-training-worker-0 1/1 Running 0 50s
4. Expansion training tasks
In ScaleOut CR, specify the number of workers to be expanded through the spec.toAdd.count field:
apiVersion: kai.alibabacloud.com/v1alpha1
kind: ScaleOut
metadata:
name: elastic-training-scaleout-9dtmw
namespace: default
spec:
selector:
name: elastic-training
timeout: 300
toAdd:
count: 2
Run example:
kubectl create -f examples/scale_out.yaml
Detect the execution status of shrinking and training tasks:
kubectl get scaleout
NAME PHASE AGE
elastic-training-scaleout-9dtmw ScaleSucceeded 30s
kubectl get po
NAME READY STATUS RESTARTS AGE
elastic-training-launcher 1/1 Running 0 2m5s
elastic-training-worker-0 1/1 Running 0 2m8s
elastic-training-worker-1 1/1 Running 0 40s
elastic-training-worker-2 1/1 Running 0 40s
to sum up
ET-Operator provides a set of training and scaling CRDs and Controllers, allowing us to easily run flexible distributed training on Kubernetes, support the distribution of distributed training tasks, and through the integration of the distributed framework, run in training tasks During the process, the workers involved in the calculation are dynamically expanded and reduced. To make our training tasks flexible, combined with preemptive instances, can better utilize the resource elasticity and cost-effective advantages of the cloud.
Author | Xu Xiaozhou (Xiao Yuan)
This article is the original content of Alibaba Cloud and may not be reproduced without permission.