The practice of deep learning framework TensorFlow on Kubernetes

What is deep learning?

 

The term deep learning has been heard many times. What is it? The technology behind it actually originated from neural networks. The neural network was first inspired by the working principle of the human brain. We know that the human brain is a very complex structure. It can be divided into many areas, such as the auditory center and the visual center. When I was in the research center, I did video research and computer vision research. Room, language has language, voice has phonetic, different functions have been separated in the division of disciplines, which has something to do with our human understanding of the brain. Scientists later discovered that the human brain is a general computational model.

Scientists did such an experiment by cutting off the nerve and ear pathways in the auditory center of mice, and connecting visual input to the auditory center. After a few months, the mice could process visual signals through the auditory center. This shows that the working principle of the human brain is the same, and the working principle of neurons is the same, but it needs continuous training. Based on this assumption, neuroscientists have made such an attempt, hoping to bring hope to blind people to see the world again. They are equivalent to connecting electrodes to the tongue, and transmitting different pixels to the tongue through the camera, so that the blind can see the world again. It is possible to see the world through the tongue. A better understanding of how human nerves work allows us to see the promise of deep learning as a general learning model.

The figure above shows the general structure of the neural network. The left side of the figure is a human neuron, and the right side is a neural network neuron. The neurons of a neural network were first inspired by the structure of human neurons and tried to model how human neurons work. The specific technology is not discussed in depth here. The lower part of the above figure shows the comparison between the human neural network and the artificial neural network (Artificial Neural Network). In the computer neural network, we need to clearly define the input layer and the output layer. Reasonable use of the input and output of artificial neural network can help us solve practical problems.

The core working principle of the neural network is to convert a given input signal into an output signal, so that the output signal can solve the problem to be solved. For example, when completing a text classification problem, we need to classify articles as sports or art. Then we can provide the words in the article as input to the neural network, and the output nodes represent different kinds. Which category should the article belong to, then we want the output value of the corresponding output node to be 1, and the output value of the others to be 0. By properly setting the structure of the neural network and the parameters in the training neural network, the trained neural network model can help us determine which category an article should belong to.

Application of Deep Learning in Image Recognition

Deep learning, its original application, was in image recognition. The most classic application is the Imagenet dataset.

ImageNet is a very large dataset with 15 million images in it. The image below shows a sample image from the dataset.

Before deep learning algorithms were applied, traditional machine learning methods had limited capabilities for image processing. Before 2012, the best machine learning algorithms were able to achieve an error rate of 25%, and it is hard to break new ground. In 2012, deep learning was first applied to the ImageNet dataset, which directly reduced the error rate to 16%. In the ensuing years, as deep learning algorithms improved, the error rate dropped all the way down to 3.5% in 2016. On the ImageNet dataset, the human classification error rate is about 5.1%. We can see that the error rate of machines is lower than that of humans, which is a technological breakthrough brought by deep learning.

What is TensorFlow

TensorFlow is a deep learning framework that was open sourced by Google in November last year. We mentioned AlphaGo at the beginning, and its development team DeepMind has announced that all subsequent systems will be implemented based on TensorFlow. TensorFlow is a very powerful open source deep learning open source tool. It can support mobile phones, CPUs, GPUs, and distributed clusters. TensorFlow is widely used in both academia and industry. In the industrial world, systems such as Google Translate and Google RankBrain developed based on TensorFlow have been launched. In academia, many of my classmates at CMU and Peking University said that TensorFlow is their preferred tool for implementing deep learning algorithms.

The ppt above gives a simple TensorFlow program example that implements the function of vector addition. TensorFlow provides APIs in Python and C++, but the API in Python is more comprehensive, so most TensorFlow programs are implemented in Python. In the first line of the above program, we load TensorFlow in via import. All data in TensorFlow is stored in the form of tensors. To calculate the specific value of the data in the tensors, we need to pass a session.

The second line in the code above shows how to generate a session. Sessions manage the computing resources required to run a TensorFlow program. A special tensor in TensorFlow is a variable (tf.Variable). Before using a variable, we need to explicitly call the variable initialization process. In the last line of the above code, we can see that to get the value of the resulting tensor output, we need to explicitly call the process of calculating the value of the tensor.

Implementing a neural network with TensorFlow is very simple. The MNIST handwritten digit recognition problem can be implemented in 10 lines with TFLearn or TensorFlow-Slim. The above ppt shows TensorFlow's support for different neural network structures. It can be seen that TensorFlow can support various main neural network structures within a very short code.

Although TensorFlow can quickly implement the functions of neural networks, it is difficult to train large-scale deep neural networks with the stand-alone version of TensorFlow.

这张图给出了谷歌在2015年提出的Inception-v3模型。这个模型在ImageNet数据集上可以达到95%的正确率。然而,这个模型中有2500万个参数,分类一张图片需要50亿次加法或者乘法运算。即使只是使用这样大规模的神经网络已经需要非常大的计算量了,如果需要训练深层神经网络,那么需要更大的计算量。神经网络的优化比较复杂,没有直接的数学方法求解,需要反复迭代。在单机上要把Inception-v3模型训练到78%的准确率大概需要5个多月的时间。如果要训练到95%的正确率需要数年。这对于实际的生产环境是完全无法忍受的。

TensorFlow on Kubernetes

如我们上面所介绍的,在单机环境下是无法训练大型的神经网络的。在谷歌的内部,Google Brain以及TensorFlow都跑在谷歌内部的集群管理系统Borg上。我在谷歌电商时,我们使用的商品分类算法就跑在1千多台服务器上。在谷歌外,我们可以将TensorFlow跑在Kubernetes上。在介绍如何将TensorFlow跑在Kubernetes上之前,我们先来介绍一下如何并行化的训练深度学习的模型。

深度学习模型常用的有两种分布式训练方式。一种是同步更新,另一种是异步更新。如上面的ppt所示,在同步更新模式下,所有服务器都会统一读取参数的取值,计算参数梯度,最后再统一更新。而在异步更新模式下,不同服务器会自己读取参数,计算梯度并更新参数,而不需要与其他服务器同步。同步更新的最大问题在于,不同服务器需要同步完成所有操作,于是快的服务器需要等待慢的服务器,资源利用率会相对低一些。而异步模式可能会使用陈旧的梯度更新参数导致训练的效果受到影响。不同的更新模式各有优缺点,很难统一的说哪一个更好,需要具体问题具体分析。

无论使用哪种更新方式,使用分布式TensorFlow训练深度学习模型需要有两种类型的服务器,一种是参数服务器,一种是计算服务器。参数服务器管理并保存神经网络参数的取值;计算服务器负责计算参数的梯度。

在TensorFlow中启动分布式深度学习模型训练任务也有两种模式。一种为In-graph replication。在这种模式下神经网络的参数会都保存在同一个TensorFlow计算图中,只有计算会分配到不同计算服务器。另一种为Between-graph replication,这种模式下所有的计算服务器也会创建参数,但参数会通过统一的方式分配到参数服务器。因为In-graph replication处理海量数据的能力稍弱,所以Between-graph replication是一个更加常用的模式。

最后一个问题,我们刚刚提到TensorFlow是支持以分布式集群的方式运行的,那么为什么还需要Kubernetes?如果我们将TensorFlow和Hadoop系统做一个简单的类比就可以很清楚的解释这个问题。大家都知道Hadoop系统主要可以分为Yarn、HDFS和mapreduce计算框架,那么TensorFlow就相当于只是Hadoop系统中Mapreduce计算框架的部分。

TensorFlow没有类似Yarn的调度系统,也没有类似HDFS的存储系统。这就是Kubernetes需要解决的部分。Kubernetes可以提供任务调度、监控、失败重启等功能。没有这些功能,我们很难手工的去每一台机器上启动TensorFlow服务器并时时监控任务运行的状态。除此之外,分布式TensorFlow目前不支持生命周期管理,结束的训练进程并不会自动关闭,这也需要进行额外的处理。

 

http://www.infoq.com/cn/articles/practise-of-tensorflow-on-kubernetes

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326941434&siteId=291194637