Vernacular tensorflow distributed deployment and development

Regarding the distributed training and deployment of tensorflow, there is an official English document introduction, but the writing is relatively simple, and the examples given are relatively simple, and those who are new to distributed deep learning may not be easy to understand. I saw some information on the Internet, and I always felt that it was not easy to understand enough. It is better to write an easy to understand one and share it with everyone.





1. Single-machine multi-GPU training

First briefly introduce the multi-GPU training of single machine, and then introduce the distributed multi-machine multi-GPU training. :

For single-machine multi-GPU training, the official tensorflow has given an example of cifar, and there are more detailed code and document introductions. Here is a general introduction to the multi-GPU process, so as to facilitate the introduction of multi-machine multi-GPU.

Single machine multi-GPU training process:

a) Assuming you have 3 GPUs on your machine;

b) In the training of a single machine and a single GPU, the data is a batch-by-batch training. In a single machine with multiple GPUs, the data is processed in 3 batches at a time (assuming 3 GPUs for training), and each GPU processes one batch of data calculations.

c) Variables, or parameters, are stored on the CPU

d) At the beginning, the data is distributed by the CPU to 3 GPUs, the calculation is completed on the GPU, and the gradient to be updated for each batch is obtained.

e) Then collect the gradients to be updated on the 3 GPUs on the CPU, calculate the average gradient, and then update the parameters.

f) Then continue to loop the process.

Through this process, the processing speed depends on the speed of the slowest GPU. If the processing speed of the three GPUs is similar, the processing speed is equivalent to 3 times the speed of a single GPU minus the overhead of data transmission between the CPU and the GPU. The actual efficiency improvement depends on the speed of the data between the CPU and the GPU. The size of the processed data.

<iframe id="aswift_2" style="box-sizing: border-box; margin: 0px; padding: 0px; left: 0px; position: absolute; top: 0px; width: 728px; height: 90px;" name="aswift_2" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" width="728" height="90"></iframe>

Writing here, I feel that what I wrote is different and easy to understand. Let’s use a more popular analogy to explain:

The teacher gave Xiaoming and Xiaohua 10,000 sheets of multiplication problems and added up all the multiplication results. There are 128 multiplication problems on each sheet. Here a piece of paper is a batch, and the batch_size is 128. Xiaoming's addition is faster, and Xiaohua's multiplication is faster, so Xiaohua is responsible for calculating the multiplication, and Xiaoming is responsible for adding up Xiaohua's multiplication results. In this way, Xiaoming is the CPU, and Xiaohua is the GPU.

In this way, it is estimated that Xiaoming and Xiaohua will spend a week to complete the tasks assigned by the teacher. So Xiao Ming recruited 2 Xiaohong and Xiaoliang, who were also fast in multiplication and calculation. So every time Xiaoming gave Xiaohua, Xiaohong, and Xiaoliang a piece of paper, and asked them to calculate the multiplication. After the three of them finished the calculation, they told Xiaoming the result. Xiaoming added up their results, and then gave them the results. The person handed out a piece of paper for multiplication, and cycled in turn until all the calculations were done.

Here, Xiaoming adopts the synchronous mode, that is, after each time the three of them have finished calculating, they will calculate the addition together. So the speed depends on the slowest of the three of them, and the speed of handing out the paper.

 

2. Distributed multi-machine multi-GPU training

As the designed model becomes more and more complex, the model parameters become more and more large, and how large is it? How much? There are tens of billions of multi-parameters, and the amount of training data is measured at the terabyte level. Everyone knows that every time a round is calculated, the gradient is calculated and the parameters are updated. When the magnitude of parameters rises to the order of tens of billions or even larger, the performance of parameter update is a problem. If it is a single machine with 16 GPUs, a step can also process 16 batches at most. For the data of the TB level, it is not known when to train. So there is a distributed deep learning training method, or framework.

 

parameter server

Before introducing the distributed training of tensorflow, let's talk about the concept of parameter server.

As mentioned earlier, when your model is getting bigger and bigger, the parameters of the model are getting more and more, and when the model parameters are updated, and the performance of one machine is not enough, it is natural that we will think of separating the parameters into different machine to store and update.

Because of the problems mentioned above, all parameter servers are screwed out separately, so there is the concept of parameter server. The parameter server can be a cluster composed of multiple machines, which is similar to a distributed storage architecture, involving data synchronization, consistency, etc., generally in the form of key-value, which can be understood as a distributed key- value in-memory database, and then add some parameter update operations. For more details, you can go to google, and I won't go into details here. Anyway, when the performance is not enough, tens of billions of parameters are scattered to different machines to save and update to solve the performance problem of parameter storage and update.

Borrowing the example of Xiaoming's arithmetic problem above, Xiaoming felt that he couldn't do the addition, so he asked 10 Xiaoming to come and help with the calculation.

 

tensorflow distributed

However, it is said that the distribution of tensorflow does not use the parameter server, but uses the data flow graph. This has not been studied yet, but it should have many similarities with the parameter server. The structure of the parameter server is introduced here.

The distribution of tensorflow has two architectural modes: in-graph and between-gragh. Here are introduced separately.

in-graph mode:

in-graph模式和单机多GPU模型有点类似。 还是一个小明算加法, 但是算乘法的就可以不止是他们一个教室的小华,小红,小亮了。 可以是其他教师的小张,小李。。。。.

 

in-graph模式, 把计算已经从单机多GPU,已经扩展到了多机多GPU了, 不过数据分发还是在一个节点。 这样的好处是配置简单, 其他多机多GPU的计算节点,只要起个join操作, 暴露一个网络接口,等在那里接受任务就好了。 这些计算节点暴露出来的网络接口,使用起来就跟本机的一个GPU的使用一样, 只要在操作的时候指定tf.device("/job:worker/task:n"), 就可以向指定GPU一样,把操作指定到一个计算节点上计算,使用起来和多GPU的类似。 但是这样的坏处是训练数据的分发依然在一个节点上, 要把训练数据分发到不同的机器上, 严重影响并发训练速度。在大数据训练的情况下, 不推荐使用这种模式。

 

between-graph模式

between-graph模式下,训练的参数保存在参数服务器, 数据不用分发, 数据分片的保存在各个计算节点, 各个计算节点自己算自己的, 算完了之后, 把要更新的参数告诉参数服务器,参数服务器更新参数。 这种模式的优点是不用训练数据的分发了, 尤其是在数据量在TB级的时候, 节省了大量的时间,所以大数据深度学习还是推荐使用between-graph模式。

 

同步更新和异步更新

in-graph模式和between-graph模式都支持同步和异步更新

在同步更新的时候, 每次梯度更新,要等所有分发出去的数据计算完成后,返回回来结果之后,把梯度累加算了均值之后, 再更新参数。 这样的好处是loss的下降比较稳定, 但是这个的坏处也很明显, 处理的速度取决于最慢的那个分片计算的时间。

在异步更新的时候, 所有的计算节点,各自算自己的, 更新参数也是自己更新自己计算的结果, 这样的优点就是计算速度快, 计算资源能得到充分利用,但是缺点是loss的下降不稳定, 抖动大。

在数据量小的情况下, 各个节点的计算能力比较均衡的情况下, 推荐使用同步模式;数据量很大,各个机器的计算性能掺差不齐的情况下,推荐使用异步的方式。

例子

ensorflow官方有个分布式tensorflow的文档,但是例子没有完整的代码, 这里写了一个最简单的可以跑起来的例子,供大家参考,这里也傻瓜式给大家解释一下代码,以便更加通俗的理解。

代码位置:

https://github.com/thewintersun/distributeTensorflowExample

功能说明:

代码实现的功能: 对于表达式 Y = 2 * X + 10, 其中X是输入,Y是输出, 现在有很多X和Y的样本, 怎么估算出来weight是2和biasis是10.

所有的节点,不管是ps节点还是worker节点,运行的都是同一份代码, 只是命令参数指定不一样。

执行的命令示例:

ps 节点执行:


CUDA_VISIBLE_DEVICES='' python distribute.py --ps_hosts=192.168.100.42:2222 --worker_hosts=192.168.100.42:2224,192.168.100.253:2225 --job_name=ps --task_index=0

 

worker 节点执行:


CUDA_VISIBLE_DEVICES=0 python distribute.py --ps_hosts=192.168.100.42:2222 --worker_hosts=192.168.100.42:2224,192.168.100.253:2225 --job_name=worker --task_index=0
CUDA_VISIBLE_DEVICES=0 python distribute.py --ps_hosts=192.168.100.42:2222 --worker_hosts=192.168.100.42:2224,192.168.100.253:2225 --job_name=worker --task_index=1

 

前面是参数定义,这里大家应该都知道,:


# Define parameters
FLAGS = tf.app.flags.FLAGS
tf.app.flags.DEFINE_float('learning_rate', 0.00003, 'Initial learning rate.')
tf.app.flags.DEFINE_integer('steps_to_validate', 1000,
                     'Steps to validate and print loss')
# For distributed
tf.app.flags.DEFINE_string("ps_hosts", "",
                           "Comma-separated list of hostname:port pairs")
tf.app.flags.DEFINE_string("worker_hosts", "",
                           "Comma-separated list of hostname:port pairs")
tf.app.flags.DEFINE_string("job_name", "", "One of 'ps', 'worker'")
tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")
# Hyperparameters
learning_rate = FLAGS.learning_rate
steps_to_validate = FLAGS.steps_to_validate

代码说明:

1. 故意把学习率设置的特别小,是想让它算慢点,好看见过程;

2. 通过命令行参数可以传入ps节点的ip和端口, worker节点的ip和端口。ps节点就是paramter server的缩写, 主要是保存和更新参数的节点, worker节点主要是负责计算的节点。这里说的节点都是虚拟的节点,不一定是物理上的节点;

3. 多个节点用逗号分隔;

 


  ps_hosts = FLAGS.ps_hosts.split(",")
  worker_hosts = FLAGS.worker_hosts.split(",")
  cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
  server = tf.train.Server(cluster,job_name=FLAGS.job_name,task_index=FLAGS.task_index)

  if FLAGS.job_name == "ps":
    server.join()
  elif FLAGS.job_name == "worker":
    with tf.device(tf.train.replica_device_setter(
                    worker_device="/job:worker/task:%d" % FLAGS.task_index,
                    cluster=cluster)):

1. ClusterSpec的定义,需要把你要跑这个任务的所有的ps和worker 的节点的ip和端口的信息都包含进去, 所有的节点都要执行这段代码, 就大家互相知道了, 这个集群里面都有哪些成员,不同的成员的类型是什么, 是ps节点还是worker节点。

2. tf.train.Server这个的定义开始,就每个节点不一样了。 根据执行的命令的参数不同,决定了这个任务是哪个任务

如果任务名字是ps的话, 程序就join到这里,作为参数更新的服务, 等待其他worker节点给他提交参数更新的数据。

如果是worker任务,就执行后面的计算任务。

3. replica_device_setter, 这个大家可以注意一下, 可以看看tensorflow的文档对这个的解释和python的源码。 在这个with语句之下定义的参数, 会自动分配到参数服务器上去定义,如果有多个参数服务器, 就轮流循环分配。

 


global_step = tf.Variable(0, name='global_step', trainable=False)

      input = tf.placeholder("float")
      label = tf.placeholder("float")

      weight = tf.get_variable("weight", [1], tf.float32, initializer=tf.random_normal_initializer())
      biase  = tf.get_variable("biase", [1], tf.float32, initializer=tf.random_normal_initializer())
      pred = tf.mul(input, weight) + biase

      loss_value = loss(label, pred)

      train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss_value, global_step=global_step)
      init_op = tf.initialize_all_variables()
      
      saver = tf.train.Saver()
      tf.scalar_summary('cost', loss_value)
      summary_op = tf.merge_all_summaries()

 

这块的代码和普通的单机单GPU的代码一样,就是定义计算逻辑,没什么区别。

 


   sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
                            logdir="./checkpoint/",
                            init_op=init_op,
                            summary_op=None,
                            saver=saver,
                            global_step=global_step,
                            save_model_secs=60)      
    with sv.managed_session(server.target) as sess:
      step = 0
      while  step < 1000000:
        train_x = np.random.randn(1)
        train_y = 2 * train_x + np.random.randn(1) * 0.33  + 10
        _, loss_v, step = sess.run([train_op, loss_value,global_step], feed_dict={input:train_x, label:train_y})
        if step % steps_to_validate == 0:
          w,b = sess.run([weight,biase])
          print("step: %d, weight: %f, biase: %f, loss: %f" %(step, w, b, loss_v))

 

1. Supervisor。 含义类似一个监督者, 就是因为分布式了, 很多机器都在运行, 像什么参数初始化, 保存模型, 写summary什么的,这个supervisoer帮你一起弄起来了, 就不用自己去手工去做这些事情了,而且在分布式的环境下设计到各种参数的共享, 其中的过程自己手工写也不好写, 于是tensorflow就给大家包装好这么一个东西了。 这里的参数is_chief比较重要, 在所有的计算节点里还是有一个主节点的, 这个主节点来负责初始化参数, 模型的保存,summary的保存。 logdir就是保存和装载模型的路径。 不过这个似乎的启动就会去这个logdir的目录去看有没有checkpoint的文件, 有的话就自动装载了,没有就用init_op指定的初始化参数, 好像没有参数指定不让它自动load的;

 

2. 主的worker节点负责模型参数初始化等工作, 在这个过程中, 其他worker节点等待主节点完成初始化工作, 等主节点初始化完成后, 好了, 大家一起开心的跑数据。

3. 这里的global_step的值,是可以所有计算节点共享的, 在执行optimizer的minimize的时候, 会自动加1, 所以可以通过这个可以知道所有的计算节点一共计算了多少步了。

 

http://www.tensorflow123.cn/baihuatfdistribute.html

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326760522&siteId=291194637