tensorflow1.13 distributed training reference material-tutorial principle

Preface

When the amount of data is large, training can be accelerated through distributed training. Compared with single-machine single-card, single-machine multi-card only need to use with tf.device('/gpu:0') to specify the GPU for computing, distributed training is more efficient because it involves the division of labor between multiple machines. Troublesome. This article briefly introduces the distributed Tensorflow training method in the case of multiple machines (single card/multiple cards is not important).

There are two main differences between distributed training and single-machine training : 1. How to start training; 2. How to divide labor during training. They will be introduced in the following two sections respectively.

1. Confirm each other

Stand-alone training can directly tell the machine "I'm going to start training" through a script, but for distributed training, multiple machines need to communicate with each other, so they need to "meet and get to know each other" first. You need to give each machine a "list" and let him find other machines. This "list" is the so-called ClusterSpec. Asking him to find other machines means that each machine must run the script once .

Let's take an example below. Suppose we use two ports "localhost:2222" and "localhost:2223" of the local machine to simulate two machines in the cluster. The work content of the two machines is a simple print sentence. First write two scripts. The first script looks like this

import tensorflow as tf

# 每台机器要做的内容(为了简化,不训练了,只print一下)
c = tf.constant("Hello from server1")

# 集群的名单
cluster = tf.train.ClusterSpec({
    
    "local":["localhost:2222", "localhost:2223"]})
# 服务的声明,同时告诉这台机器他是名单中的谁
server = tf.train.Server(cluster, job_name="local", task_index=0)
# 以server模式打开会话环境
sess = tf.Session(server.target, config=tf.ConfigProto(log_device_placement=True))
print(sess.run(c))
server.join()
然后第二个脚本长这样:
import tensorflow as tf

# 每台机器要做的内容(为了简化,不训练了,只print一下)
c = tf.constant("Hello from server2")

# 集群的名单
cluster = tf.train.ClusterSpec({
    
    "local":["localhost:2222", "localhost:2223"]})
# 服务的声明,同时告诉这台机器他是名单中的谁
server = tf.train.Server(cluster, job_name="local", task_index=1)
# 以server模式打开会话环境
sess = tf.Session(server.target, config=tf.ConfigProto(log_device_placement=True))
print(sess.run(c))
server.join()

Let's briefly explain what's in the script. These two scripts are actually almost the same length, and they both hold the same "list", that is

# 声明集群的“名单”
cluster = tf.train.ClusterSpec({
    
    "local":["localhost:2222", "localhost:2223"]})

The only difference is that when creating the Server, a different index is specified, which is equivalent to telling him which name in the list belongs to him. In fact, the principle is to start a service on each machine, and then implement communication through this service and the list.

# 第一个脚本的服务
server = tf.train.Server(cluster, job_name="local", task_index=0)
# 第二个脚本的服务
server = tf.train.Server(cluster, job_name="local", task_index=1)

Now there are two scripts (for the case of multiple machines, these two scripts are placed on different machines, but in this example, two ports of a single machine are used to simulate multiple machines, so the two scripts can be put together). Then let's get this "cluster" started! First open a command line window and run the first script in this path:

# 运行第一台机器(控制台窗口)
$ python3 server1.py

# 输出内容
# 此处省略 N 行内容
2020-04-24 14:58:58.841179: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:local/replica:0/task:1
2020-04-24 14:59:08.844255: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:local/replica:0/task:1
2020-04-24 14:59:18.847998: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:local/replica:0/task:1
2020-04-24 14:59:28.852471: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:local/replica:0/task:1
2020-04-24 14:59:38.852649: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:local/replica:0/task:1
2020-04-24 14:59:48.856933: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:local/replica:0/task:1

Ignoring the WARNING part, the constant output of CreateSession still waiting for response from worker in the command line indicates that the service is waiting for other machines in the cluster. After all, we have not yet let the second machine join. Next we reopen a command line window (representing another machine) and start another script in the directory:

# 运行第二台机器(控制台窗口)
$ python3 server2.py

# 输出内容
# 此处省略 N 行内容
Const: (Const): /job:local/replica:0/task:0/device:CPU:0
2020-04-24 15:02:27.653508: I tensorflow/core/common_runtime/placer.cc:54] Const: (Const): /job:local/replica:0/task:0/device:CPU:0
b'Hello from server2'

We see that when the second script starts running, all (two) machines in the cluster are present and start working. The second machine directly printed the content b'Hello from server2'. At the same time, the first machine also started working

# 第二个台机器(控制台窗口)加入到集群之后,第一台机器的输出
Const: (Const): /job:local/replica:0/task:0/device:CPU:0
2020-04-24 15:02:28.732132: I tensorflow/core/common_runtime/placer.cc:54] Const: (Const): /job:local/replica:0/task:0/device:CPU:0
b'Hello from server1'

To sum up, for distributed training, the first step is that each machine should have a script; the second step is to give each machine the same "list", which is the ClusterSpec; Run the script and start the service; finally, multiple machines can communicate with each other.

2. Close cooperation

The first section introduces how machines in the cluster recognize each other and start working together. This section mainly introduces how the machines in the cluster have a clear division of labor and cooperate with each other to complete training. In the previous example, the lists of the two machines are declared through the ClusterSpec, and the two machines have no complicated role division, and they both print a sentence.

tf.train.ClusterSpec({
    
    "local":["localhost:2222", "localhost:2223"]})

In fact, it will be more complicated in the complex training process. We need to assign different tasks to each machine, which is generally divided into PS machines and worker machines. The ps machine is responsible for saving network parameters, summarizing gradient values, and updating network parameters, while the worker machine is mainly responsible for forward conduction and reverse calculation. This is what you need to do when creating a ClusterSpec.

# 通常将机器分工为ps和worker,不过可以根据实际情况灵活分工。
# 只是在编写代码时明确每种分工的机器要做什么事情就可以
tf.train.ClusterSpec({
    
    
    "ps":["localhost:2222"],    # 用来保存、更新参数的机器
    "worker":["localhost:2223", "localhost:2224"]    # 用来正向传播、反向计算梯度的机器
})

In this example, the three ports of this machine are still used to simulate three machines. The key of the parameter dictionary of ClusterSpec is the name of the cluster division of labor, and the value is the list of machines under the division of labor.

Now that we know how to define a cluster, let's take a look at how to assign tasks to each machine. In the example in the first section we wrote two similar scripts, but this is very laborious and difficult to maintain on a large cluster. It is best to write only one script, and then when running on different machines, just tell the machine "division" (ps or worker) and "name" (ip:port) through parameters. Distributed training methods are divided into asynchronous training and synchronous training. Below we introduce them respectively:

2.1 Asynchronous distributed training

We are still using a simple DNN to classify the MNIST data set as an example. The script should look like this:

# 异步分布式训练
#coding=utf-8
import time
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data    # 数据的获取不是本章重点,这里直接导入

FLAGS = tf.app.flags.FLAGS
tf.app.flags.DEFINE_string("job_name", "worker", "ps or worker")
tf.app.flags.DEFINE_integer("task_id", 0, "Task ID of the worker/ps running the train")
tf.app.flags.DEFINE_string("ps_hosts", "localhost:2222", "ps机")
tf.app.flags.DEFINE_string("worker_hosts", "localhost:2223,localhost:2224", "worker机,用逗号隔开")

# 全局变量
MODEL_DIR = "./distribute_model_ckpt/"
DATA_DIR = "./data/mnist/"
BATCH_SIZE = 32


# main函数
def main(self):
    # ==========  STEP1: 读取数据  ========== #
    mnist = input_data.read_data_sets(DATA_DIR, one_hot=True, source_url='http://yann.lecun.com/exdb/mnist/')    # 读取数据

    # ==========  STEP2: 声明集群  ========== #
    # 构建集群ClusterSpec和服务声明
    ps_hosts = FLAGS.ps_hosts.split(",")
    worker_hosts = FLAGS.worker_hosts.split(",")
    cluster = tf.train.ClusterSpec({
    
    "ps":ps_hosts, "worker":worker_hosts})    # 构建集群名单
    server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_id)    # 声明服务

    # ==========  STEP3: ps机内容  ========== #
    # 分工,对于ps机器不需要执行训练过程,只需要管理变量。server.join()会一直停在这条语句上。
    if FLAGS.job_name == "ps":
        with tf.device("/cpu:0"):
            server.join()

    # ==========  STEP4: worker机内容  ========== #
    # 下面定义worker机需要进行的操作
    is_chief = (FLAGS.task_id == 0)    # 选取task_id=0的worker机作为chief

    # 通过replica_device_setter函数来指定每一个运算的设备。
    # replica_device_setter会自动将所有参数分配到参数服务器上,将计算分配到当前的worker机上。
    device_setter = tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % FLAGS.task_id,
        cluster=cluster)

    # 这一台worker机器需要做的计算内容
    with tf.device(device_setter):
        # 输入数据
        x = tf.placeholder(name="x-input", shape=[None, 28*28], dtype=tf.float32)    # 输入样本像素为28*28
        y_ = tf.placeholder(name="y-input", shape=[None, 10], dtype=tf.float32)      # MNIST是十分类
        # 第一层(隐藏层)
        with tf.variable_scope("layer1"):
            weights = tf.get_variable(name="weights", shape=[28*28, 128], initializer=tf.glorot_normal_initializer())
            biases = tf.get_variable(name="biases", shape=[128], initializer=tf.glorot_normal_initializer())
            layer1 = tf.nn.relu(tf.matmul(x, weights) + biases, name="layer1")
        # 第二层(输出层)
        with tf.variable_scope("layer2"):
            weights = tf.get_variable(name="weights", shape=[128, 10], initializer=tf.glorot_normal_initializer())
            biases = tf.get_variable(name="biases", shape=[10], initializer=tf.glorot_normal_initializer())
            y = tf.add(tf.matmul(layer1, weights), biases, name="y")
        pred = tf.argmax(y, axis=1, name="pred")
        global_step = tf.contrib.framework.get_or_create_global_step()    # 必须手动声明global_step否则会报错
        # 损失和优化
        cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=y, labels=tf.argmax(y_, axis=1))
        loss = tf.reduce_mean(cross_entropy)
        train_op = tf.train.GradientDescentOptimizer(0.01).minimize(loss, global_step=global_step)
        if is_chief:
            train_op = tf.no_op()
    
        hooks = [tf.train.StopAtStepHook(last_step=10000)]
        config = tf.ConfigProto(
            allow_soft_placement=True,    # 设置成True,那么当运行设备不满足要求时,会自动分配GPU或者CPU。
            log_device_placement=False,   # 设置为True时,会打印出TensorFlow使用了哪种操作
        )

        # ==========  STEP5: 打开会话  ========== #
        # 对于分布式训练,打开会话时不采用tf.Session(),而采用tf.train.MonitoredTrainingSession()
        # 详情参考:https://www.cnblogs.com/estragon/p/10034511.html
        with tf.train.MonitoredTrainingSession(
                master=server.target,
                is_chief=is_chief,
                checkpoint_dir=MODEL_DIR,
                hooks=hooks,
                save_checkpoint_secs=10,
                config=config) as sess:
            print("session started!")
            start_time = time.time()
            step = 0
        
            while not sess.should_stop():
                xs, ys = mnist.train.next_batch(BATCH_SIZE)    # batch_size=32
                _, loss_value, global_step_value = sess.run([train_op, loss, global_step], feed_dict={
    
    x:xs, y_:ys})
                if step > 0 and step % 100 == 0:
                    duration = time.time() - start_time
                    sec_per_batch = duration / global_step_value
                    print("After %d training steps(%d global steps), loss on training batch is %g (%.3f sec/batch)" % (step, global_step_value, loss_value, sec_per_batch))
                step += 1
    

if __name__ == "__main__":
    tf.app.run()

Although the code is relatively long, the overall structure is still very clear. The structure is divided into 5 steps: 1. Read data, 2. Declare the cluster, 3. PS machine content, 4. Worker machine content, 5. Open session. The fourth step "worker machine content" includes the definition of the network structure, which is relatively complicated.

Next, you only need to place the script on three different machines in the cluster, and then run it separately. First, run the ps machine script:

# ps机脚本
$ python3 distribute_train.py --job_name=ps --task_id=0 --ps_hosts=localhost:2222 --worker_hosts=localhost:2223,localhost:2224

# 输出
2020-04-24 17:16:44.530325: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-24 17:16:44.546565: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x102ccad20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-24 17:16:44.546582: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-04-24 17:16:44.548075: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job ps -> {
    
    0 -> localhost:2222}
2020-04-24 17:16:44.548088: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job worker -> {
    
    0 -> localhost:2223, 1 -> localhost:2224}
2020-04-24 17:16:44.548525: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:365] Started server with target: grpc://localhost:2222

Then run the first worker machine script. After starting, it will wait for other worker machines to join:

# 第一个worker机
$ python3 distribute_train.py --job_name=worker --task_id=0 --ps_hosts=localhost:2222 --worker_hosts=localhost:2223,localhost:2224

# 这里省略 N 行输出
2020-04-24 17:25:41.174507: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1
2020-04-24 17:25:51.176111: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1
2020-04-24 17:26:01.180872: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1
2020-04-24 17:26:11.184377: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1

Then run the script for the second worker machine:

# 第二个worker机
$ python3 distribute_train.py --job_name=worker --task_id=0 --ps_hosts=localhost:2222 --worker_hosts=localhost:2223,localhost:2224

# 输出
session started!
After 100 training steps(100 global steps), loss on training batch is 1.59204 (0.004 sec/batch)
After 200 training steps(200 global steps), loss on training batch is 1.10218 (0.003 sec/batch)
After 300 training steps(300 global steps), loss on training batch is 0.71179 (0.003 sec/batch)
After 400 training steps(400 global steps), loss on training batch is 0.679103 (0.002 sec/batch)
After 500 training steps(500 global steps), loss on training batch is 0.50411 (0.002 sec/batch)
# 这里省略 N 行输出

2.2 Synchronous distributed training

DNN is also used to perform the classification task of the MNIST data set:

# 异步分布式训练
#coding=utf-8
import time
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data    # 数据的获取不是本章重点,这里直接导入

FLAGS = tf.app.flags.FLAGS
tf.app.flags.DEFINE_string("job_name", "worker", "ps or worker")
tf.app.flags.DEFINE_integer("task_id", 0, "Task ID of the worker/ps running the train")
tf.app.flags.DEFINE_string("ps_hosts", "localhost:2222", "ps机")
tf.app.flags.DEFINE_string("worker_hosts", "localhost:2223,localhost:2224", "worker机,用逗号隔开")

# 全局变量
MODEL_DIR = "./distribute_model_ckpt/"
DATA_DIR = "./data/mnist/"
BATCH_SIZE = 32


# main函数
def main(self):
    # ==========  STEP1: 读取数据  ========== #
    mnist = input_data.read_data_sets(DATA_DIR, one_hot=True, source_url='http://yann.lecun.com/exdb/mnist/')    # 读取数据

    # ==========  STEP2: 声明集群  ========== #
    # 构建集群ClusterSpec和服务声明
    ps_hosts = FLAGS.ps_hosts.split(",")
    worker_hosts = FLAGS.worker_hosts.split(",")
    cluster = tf.train.ClusterSpec({
    
    "ps":ps_hosts, "worker":worker_hosts})    # 构建集群名单
    server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_id)    # 声明服务
    n_workers = len(worker_hosts)    # worker机的数量

    # ==========  STEP3: ps机内容  ========== #
    # 分工,对于ps机器不需要执行训练过程,只需要管理变量。server.join()会一直停在这条语句上。
    if FLAGS.job_name == "ps":
        with tf.device("/cpu:0"):
            server.join()

    # ==========  STEP4: worker机内容  ========== #
    # 下面定义worker机需要进行的操作
    is_chief = (FLAGS.task_id == 0)    # 选取task_id=0的worker机作为chief

    # 通过replica_device_setter函数来指定每一个运算的设备。
    # replica_device_setter会自动将所有参数分配到参数服务器上,将计算分配到当前的worker机上。
    device_setter = tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % FLAGS.task_id,
        cluster=cluster)

    # 这一台worker机器需要做的计算内容
    with tf.device(device_setter):
        # 输入数据
        x = tf.placeholder(name="x-input", shape=[None, 28*28], dtype=tf.float32)    # 输入样本像素为28*28
        y_ = tf.placeholder(name="y-input", shape=[None, 10], dtype=tf.float32)      # MNIST是十分类
        # 第一层(隐藏层)
        with tf.variable_scope("layer1"):
            weights = tf.get_variable(name="weights", shape=[28*28, 128], initializer=tf.glorot_normal_initializer())
            biases = tf.get_variable(name="biases", shape=[128], initializer=tf.glorot_normal_initializer())
            layer1 = tf.nn.relu(tf.matmul(x, weights) + biases, name="layer1")
        # 第二层(输出层)
        with tf.variable_scope("layer2"):
            weights = tf.get_variable(name="weights", shape=[128, 10], initializer=tf.glorot_normal_initializer())
            biases = tf.get_variable(name="biases", shape=[10], initializer=tf.glorot_normal_initializer())
            y = tf.add(tf.matmul(layer1, weights), biases, name="y")
        pred = tf.argmax(y, axis=1, name="pred")
        global_step = tf.contrib.framework.get_or_create_global_step()    # 必须手动声明global_step否则会报错
        # 损失和优化
        cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=y, labels=tf.argmax(y_, axis=1))
        loss = tf.reduce_mean(cross_entropy)
        # **通过tf.train.SyncReplicasOptimizer函数实现函数同步更新**
        opt = tf.train.SyncReplicasOptimizer(
            tf.train.GradientDescentOptimizer(0.01),
            replicas_to_aggregate=n_workers,
            total_num_replicas=n_workers
        )
        sync_replicas_hook = opt.make_session_run_hook(is_chief)
        train_op = opt.minimize(loss, global_step=global_step)
        if is_chief:
            train_op = tf.no_op()
    
        hooks = [sync_replicas_hook, tf.train.StopAtStepHook(last_step=10000)]    # 把同步更新的hook加进来
        config = tf.ConfigProto(
            allow_soft_placement=True,    # 设置成True,那么当运行设备不满足要求时,会自动分配GPU或者CPU。
            log_device_placement=False,   # 设置为True时,会打印出TensorFlow使用了哪种操作
        )

        # ==========  STEP5: 打开会话  ========== #
        # 对于分布式训练,打开会话时不采用tf.Session(),而采用tf.train.MonitoredTrainingSession()
        # 详情参考:https://www.cnblogs.com/estragon/p/10034511.html
        with tf.train.MonitoredTrainingSession(
                master=server.target,
                is_chief=is_chief,
                checkpoint_dir=MODEL_DIR,
                hooks=hooks,
                save_checkpoint_secs=10,
                config=config) as sess:
            print("session started!")
            start_time = time.time()
            step = 0
        
            while not sess.should_stop():
                xs, ys = mnist.train.next_batch(BATCH_SIZE)    # batch_size=32
                _, loss_value, global_step_value = sess.run([train_op, loss, global_step], feed_dict={
    
    x:xs, y_:ys})
                if step > 0 and step % 100 == 0:
                    duration = time.time() - start_time
                    sec_per_batch = duration / global_step_value
                    print("After %d training steps(%d global steps), loss on training batch is %g (%.3f sec/batch)" % (step, global_step_value, loss_value, sec_per_batch))
                step += 1
    

if __name__ == "__main__":
    tf.app.run()

Synchronous distributed training is almost the same as asynchronous distributed training, with only two differences:

  • The optimizer should use tf.train.SyncReplicasOptimizer instead of tf.train.GradientDescentOptimizer
  • Hooks need to add sync_replicas_hook = opt.make_session_run_hook(is_chief)

Everything else is the same as asynchronous distributed training, so I won’t go into details here.

Good examples and explanations:
https://github.com/TracyMcgrady6/Distribute_MNIST/tree/master

Guess you like

Origin blog.csdn.net/weixin_39589455/article/details/132046457