Architecture and practice of TensorFlow on Kubernetes

Author: [email protected]

Old drivers who play with containers know that Kubernetes has been very popular in the past two years. Up to now, it has 31K+ stars on github. However, compared to TensorFlow, it can only be said to be average. TensorFlow has only been around for more than two years, and there are already 86K+ stars on github. What is this concept? You must know that the Linux kernel has only accumulated 54K+ stars for so many years. Of course, they are all overlords in their respective fields. This comparison is only a small talk.

In the past two years, Kubernetes has achieved excellent results in DevOps and microservices in various enterprises. Since 2017, more and more enterprises have begun to explore the application of Kubernetes to HPC, AI and other fields. With the rapid growth of the company's AI business, vivo also began to explore deep integration with ML frameworks such as TensorFlow in September 2017, based on the powerful distributed capabilities of Kubernetes, to improve data center resource utilization and speed up algorithm iteration.

The biggest difference between the application of Kubernetes in AI and the deployment of App in DevOps lies in the scale of the container and the life cycle of the container. In our practice, when the scale of the cluster server is small, nearly 10W of containers need to be scheduled every day, and many containers may only run for ten minutes or even a few minutes, and it is planned that in 2018, the cluster size will increase by ten times. In the DevOps scenario, no matter how frequently applications are released, I believe that there are not many companies that can schedule 10W containers within a year. Below I will talk about the architecture of TensorFlow on Kubernetes and its practice in vivo.

Distributed TensorFlow

TensorFlow is an open-source software library for numerical computation using dataflow graphs. Nodes in the graph represent mathematical operations, while edges in the graph represent multidimensional arrays (tensors) passed between those nodes. This flexible architecture lets you deploy computing jobs to one or more CPUs or GPUs in a desktop, server, or mobile device using a single API. Regarding the basic concepts of TensorFlow, I will not introduce more.

Stand-alone TensorFlow

The following is a schematic diagram of a stand-alone TensorFlow training. Submit a session through the client and define which cpu/gpu this worker will use for what to do.

Enter image description

Distributed TensorFlow

In April 2016, TensorFlow released version 0.8 to announce support for distributed computing, which we call Distributed TensorFlow. This is a very important feature, because in the world of AI, the amount of training data and model parameters can often be very large. For example, a paper published by Google Brain Lab this year OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYERmentioned a model with 68 billion Parameters. If it can only be trained on a single machine, it will take too long to accept. With Distributed TensorFlow, a large number of servers can be used to build a distributed TensorFlow cluster to improve training efficiency and reduce training time.

Through the TensorFlow Replcation mechanism, users can distribute SubGraph to different servers for distributed computing. TensorFlow's copy mechanism is divided into two types, In-graph and Between-graph.

Simply put, In-graph Replication is to define the work of all tasks in this TensorFlow cluster through a single client session.

Enter image description

In contrast, Between-graph Replication is that each worker has an independent client to define its own work.

Enter image description

The following is the abstracted distributed TensorFlow Framework as follows:

Enter image description

Let's first understand a few concepts inside:

  • Cluster

A TensorFlow Cluster consists of one or more jobs, and each job consists of one or more tasks. The definition of Cluster is defined by tf.train.ClusterSpec. For example, the ClusterSpec that defines a TensorFlow Cluster with 3 workers and 2 ps is as follows:

tf.train.ClusterSpec({
    "worker": [
        "worker0.example.com:2222",  //主机名也可以使用IP
        "worker1.example.com:2222",
        "worker2.example.com:2222"
    ],
    "ps": [
        "ps0.example.com:2222",
        "ps1.example.com:2222"
    ]})
  • Client

Client is used to build a TensorFlow Graph and build a tensorflow::Session to communicate with the cluster. One Client can interact with multiple TensorFlow Servers, and one Server can serve multiple Clients.

  • Job

A job consists of a tasks list, and the job is divided into two types: ps and worker. ps is the parameter server, which is used to store and update variables, while the worker can be considered stateless and used as a computing task. Among the workers, a chief worker (usually worker0) is generally selected to check the training state. If there is a worker failure, it can be restored from the latest checkpoint.

  • Task

Each Task corresponds to a TensorFlow Server, which corresponds to a separate process. A Task belongs to a certain Job, and its position in the tasks of the corresponding Job is marked by an index. Each TensorFlow implements Master service and Worker service. The Master service is used for grpc interaction with worker services in the cluster. The Worker service uses the local device to calculate the subgraph.

For more information about Distributed TensorFlow, please refer to the official content www.tensorflow.org/deplopy/distributed

The flaws of distributed TensorFlow

Distributed TensorFlow can use the resource pool formed by all servers in the data center, so that a large number of PS and workers can be distributed on different servers for parameter storage and training, which is undoubtedly the key point for TensorFlow to be implemented in enterprises. However, this is not enough, it still has a congenital deficiency:

  • During training, each TensorFlow task resource cannot be isolated, which is likely to cause mutual influence between tasks due to resource preemption.
  • The lack of scheduling capabilities requires users to manually configure and manage computing resources for tasks.
  • When the scale of the cluster is large, the management of training tasks is very troublesome. To track and manage the status of each task requires a lot of development in the upper layer.
  • To view the training log of each task, users need to find the corresponding server and ssh to it, which is very inconvenient.
  • The backend file systems natively supported by TensorFlow only support: standard Posix file systems (such as NFS), HDFS, GCS, memory-mapped-file. Most of the data in the enterprise exists on the big data platform, so HDFS is the main one. However, the read performance of HDFS is not very good.
  • When you try to create a large-scale TensorFlow cluster, it is not easy;

TensorFlow on Kubernetes Architecture and Principles

These shortcomings of TensorFlow are exactly the strengths of Kubernetes:

  • Provides various resource management mechanisms such as ResourceQuota, LimitRanger, etc., which can achieve good resource isolation between tasks.
  • Configuration and scheduling of computing resources to support tasks.
  • Training tasks run in container mode, and Kubernetes provides a full set of container PLEG interfaces, so the management of task status is very convenient.
  • Easily connect to log solutions such as EFK/ELK, and users can easily view task logs.
  • Support distributed storage (Glusterfs) with better read performance, but we have not yet connected to Glusterfs, there are plans but no manpower.
  • Create a large-scale TensorFlow cluster quickly and easily through declarative files.

TensorFlow on Kubernetes Architecture

Enter image description

Principles of TensorFlow on Kubernetes

In our TensorFlow on Kubernetes solution, the following Kubernetes objects are mainly used:

  • Kubernetes Job

We use Kubernetes Job to deploy TensorFlow Worker. After the Worker training is completed and exits normally, the container will not be restarted. Note that the Pod Template restartPolicy in the Job can only be Never or OnFailure, not Always. Here we set the restartPolicy to OnFailure. Once the worker exits abnormally, it will automatically restart. However, it should be noted that it is necessary to ensure that the training can be restored from checkpoint after the worker is restarted, otherwise the worker will start from step 0 after restarting, and the training may be in vain after running for a few days. If you use the algorithm written by the TensorFlow high-level API, this is implemented by default, but if you use the low-level core API, you must pay attention to your own implementation.

kind: Job
apiVersion: batch/v1
metadata:
  name: {{ name }}-{{ task_type }}-{{ i }}
  namespace: {{ name }}
spec:
  template:
    metadata:
      labels:
        name: {{ name }}
        job: {{ task_type }}
        task: "{{ i }}"
    spec:
      imagePullSecrets:
      - name: harborsecret
      containers:
      - name: {{ name }}-{{ task_type }}-{{ i }}
        image: {{ image }}
        resources:
          requests:
            memory: "4Gi"
            cpu: "500m"
        ports:
        - containerPort: 2222
        command: ["/bin/sh", "-c", "export CLASSPATH=.:/usr/lib/jvm/java-1.8.0/lib/tools.jar:$(/usr/lib/hadoop-2.6.1/bin/hadoop classpath --glob); wget -r -nH  -np --cut-dir=1 -R 'index.html*,*gif'  {{ script }}; cd ./{{ name }}; sh ./run.sh {{ ps_hosts() }} {{ worker_hosts() }} {{ task_type }} {{ i }} {{ ps_replicas }} {{ worker_replicas }}"]
      restartPolicy: OnFailure
  • Kubernetes Deployment

TensorFlow PS is deployed with Kubernetes Deployment. Why not use Job to deploy like workers? In fact, it is not bad, but considering that the PS process does not automatically exit (always hangs) when all worker training is completed, it is meaningless to use Job deployment.

kind: Deployment
apiVersion: extensions/v1beta1
metadata:
  name: {{ name }}-{{ task_type }}-{{ i }}
  namespace: {{ name }}
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: {{ name }}
        job: {{ task_type }}
        task: "{{ i }}"
    spec:
      imagePullSecrets:
      - name: harborsecret
      containers:
      - name: {{ name }}-{{ task_type }}-{{ i }}
        image: {{ image }}
        resources:
          requests:
            memory: "4Gi"
            cpu: "500m"
        ports:
        - containerPort: 2222
        command: ["/bin/sh", "-c","export CLASSPATH=.:/usr/lib/jvm/java-1.8.0/lib/tools.jar:$(/usr/lib/hadoop-2.6.1/bin/hadoop classpath --glob); wget -r -nH  -np --cut-dir=1 -R 'index.html*,*gif'  {{ script }}; cd ./{{ name }}; sh ./run.sh {{ ps_hosts() }} {{ worker_hosts() }} {{ task_type }} {{ i }} {{ ps_replicas }} {{ worker_replicas }}"]
      restartPolicy: Always

For the problem of TensorFlow PS process hanging, please refer to https://github.com/tensorflow/tensorflow/issues/4713. We solved this problem by developing a module to watch the status of all workers in each TensorFlow cluster. When the corresponding Job of the worker is Completed, it will automatically delete the Deployment corresponding to the PS, thereby killing the PS process to release resources.

  • Kubernetes Headless Service

Headless Service is usually used to solve the internal communication between application clusters deployed in Kubernetes. Here, we also use it in the same way. We will create a Headless Service for each TensorFlow corresponding Job and Deployment object as a communication agent between workers and ps.

kind: Service
apiVersion: v1
metadata:
  name: {{ name }}-{{ task_type }}-{{ i }}
  namespace: {{ name }}
spec:
  clusterIP: None
  selector:
    name: {{ name }}
    job: {{ task_type }}
    task: "{{ i }}"
  ports:
  - port: {{ port }}
    targetPort: 2222

The advantage of using Headless Service is that in KubeDNS, the domain name resolution of Service Name directly corresponds to PodIp, and there is no service VIP layer, which does not rely on kube-proxy to create iptables rules. Without the iptables layer of kube-proxy, the performance is improved.

Enter image description

In the TensorFlow scenario, this is not to be underestimated, because a TensorFlow Task will create a service, and it is normal for tens of thousands of services. If you use Normal Service, there will be hundreds of thousands of iptables rules. Add or delete one The iptabels rules take hours or even days, and the cluster has already collapsed. For the performance test data of the kube-proxy iptables mode, please refer to the relevant sharing of the Huawei PaaS team.

  • KubeDNS Autoscaler

As mentioned earlier, each TensorFlow Task will create a service, and there will be a corresponding resolution rule in KubeDNS, but when there are too many services, we find that some workers have a high probability of failure to resolve domain names, and it takes more than ten times to successfully resolve once. This will affect the session establishment of each task in the TensorFlow cluster, which may cause the TensorFlow cluster to fail.

To solve this problem, we introduced the Kubernetes incubation project kubernetes-incubator/cluster-proportional-autoscaler to dynamically scale KubeDNS. For the specific details of this issue, interested students can check my blog post https://my.oschina.net/jxcdwangtao/blog/1581879.

TensorFlow on Kubernetes in practice

Based on the above solution, we have developed a TaaS platform that has implemented basic functions, including algorithm management, training cluster creation and management, model management, model online (TensorFlow Serving), one-click creation of TensorBoard services, task resource monitoring, Cluster resource monitoring, timed training management, online viewing of task logs, batch package download, etc. For this part, please refer to the article previously shared on DockOne http://dockone.io/article/3036.

This is just the beginning, I'm doing the following feature:

  • Support preemptive scheduling of tasks based on training priority: When users create a TensorFlow training project on TaaS, they can specify the priority of the project as Production, Iteration, and Research (PTR). The default is iteration. The priority is ** Production --> Iteration --> PTR** from high to low. However, when the cluster resources are insufficient, preemptive scheduling is performed according to the task priority.
  • Provides a resource allocation view in the form of Yarn, so that users can clearly understand the resource occupancy of all their training projects.
  • Hybrid deployment of training and prediction, providing data center resource utilization.
  • ...

experience and pit

During the whole process, we encountered many pits, including TensorFlow and Kubernetes, but the most problematic one is the CNI network plug-in contiv netplugin we use. Every major problem is basically caused by this network plug-in. Kubernetes is the least problematic, and it's more stable than I expected.

  • The problem of contiv netplugin is still stable in the DevOps environment. In large-scale and high-concurrency AI scenarios, problems emerge one after another, generating a large number of garbage IPs and Openflow flow tables, and directly turning Nodes into NotReady. Not much to say. Because as far as I know, there are very few companies using this plug-in now, so please contact me privately if you want to know more.
  • In our solution, a TensorFlow training cluster corresponds to a Kubernetes Namespace. At the beginning of the project, we did not clean up the garbage Namespace in time. Later, when there were tens of thousands of Namespaces in the cluster, the related API performance of the entire Kubernetes cluster was very poor, resulting in TaaS. The user experience is very poor.
  • The performance of TensorFlow grpc is poor. There are thousands of worker training clusters, and such errors may occur probabilistically grpc_chttp2_stream request on server; last grpc_chttp2_stream id=xxx, new grpc_chttp2_stream id=xxx. This is a performance problem of the underlying grpc of TensorFlow. The Handlergrpc of the lower version of grpc is still single-threaded. You can only try to upgrade grpc by upgrading TensorFlow. , or upgrade the grpc version separately when compiling TensorFlow. If you upgrade the TensorFlow version, your algorithm may need to do API adaptation. At present, we reduce the number of workers by increasing the computing load of a single worker to reduce the grpc pressure.
  • There are also problems with TensorFlow's own OOM mechanism, etc.

The above content comes from the online sharing I did in the DockOne WeChat group.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324447534&siteId=291194637