深度学习框架Tensorflow分布式实战多机多卡GPU,CPU并行

环境安装

  1. Cuda8.0+Tensorflow11.0+cuDNN5.1.5+Bazel0.3.2+GCC4.9
  2. !!!!版本最好跟我的一样,其他有点不一样就很容易失败
  3. cuda和cudnn看另一篇帖子http://blog.csdn.net/cq361106306/article/details/52450907 或者自己找
  4. Bazel : http://bazel.io/docs/install.html
  5. 链接:https://github.com/tensorflow/tensorflow/archive/r0.11.zip
  6. 然后进入解压后的目录
cd tensorflow
./configure 
Please specify the location of python. [Default is /usr/bin/python]:
Do you wish to build TensorFlow with Google Cloud Platform support? [y/N] N
No Google Cloud Platform support will be enabled for TensorFlow
Do you wish to build TensorFlow with GPU support? [y/N] y
GPU support will be enabled for TensorFlow
Please specify which gcc nvcc should use as the host compiler. [Default is /usr/bin/gcc]:
Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: 8.0# 一定要选你安装的版本
Please specify the location where CUDA 7.5 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify the cuDNN version you want to use. [Leave empty to use system default]: 5.1.5 # 一定要选你安装的版本
Please specify the location where cuDNN 5 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size.
[Default is: "3.5,5.2"]: 3.0
Setting up Cuda include
Setting up Cuda lib
Setting up Cuda bin
Setting up Cuda nvvm
Setting up CUPTI include
Setting up CUPTI lib64
Configuration finished

记住最好不要直接git clone最新版,因为这个框架比较新,每天都在更新,很容易出问题。比如我就遇到了某个库一直安装失败的坑爹情况
这个过程十分的久,最好开VPN翻墙
5. 如果出现了一些红色的Error。 就去本目录configure的本文文件打开找到

bazel clean --expunge
删除

然后在命令行

bazel fetch //tensorflow/...

然后反复运行直到没有报错。然后再重新./configure
5. 开始用bazel编译

# To build with GPU support:
$ bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

# The name of the .whl file will depend on your platform.
$ sudo pip install /tmp/tensorflow_pkg/tensorflow-0.11.0rc1-py2-none-any.whl

测试

cd tensorflow/models/image/mnist

python convolutional.py

编译中遇到的坑

ERROR: /home/y/tensorflow-r0.11/tensorflow/core/kernels/BUILD:1096:1: C++ compilation of rule '//tensorflow/core/kernels:svd_op' failed: crosstool_wrapper_driver_is_not_gcc failed: error executing command 
..
gcc: internal compiler error: Killed (program cc1plus) (这里表示内存不足)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-4.8/README.Bugs> for instructions.
Target //tensorflow/cc:tutorials_example_trainer failed to build

参考内存不足
https://github.com/tensorflow/tensorflow/issues/349

编译莫名其妙报ERROR-重新使用下面的

bazel build -c opt --config=cuda --spawn_strategy=standalone --verbose_failures --local_resources 2048,.5,1.0 //tensorflow/tools/pip_package:build_pip_package

这里spawn_… local_resources 之类的防止某些error. 如果这次通过。则继续上面的步骤
最终happy的结果应该是(可能会卡住,其实是在下载东西):

ensorflow-r0.11$ cd tensorflow/models/image/mnist
y@y:~/tensorflow-r0.11/tensorflow/models/image/mnist$ python convolutional.py 
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so.5.1.5 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so.8.0 locally
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties: 
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.7715
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.41GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)
Initialized!

分布式Demo

题外话,tensorflow的分布式支持只封装了底层通信机制,能让我们像用单机那样使用分布式的训练
首先ifconfig查下本机192.168开头的ip作为单机多进程模拟分布式
然后下面只有一个文件

#coding=utf-8
import numpy as np
import tensorflow as tf

# Define parameters
FLAGS = tf.app.flags.FLAGS
tf.app.flags.DEFINE_float('learning_rate', 0.00003, 'Initial learning rate.')
tf.app.flags.DEFINE_integer('steps_to_validate', 1000,
                     'Steps to validate and print loss')

# For distributed
tf.app.flags.DEFINE_string("ps_hosts", "",
                           "Comma-separated list of hostname:port pairs")
tf.app.flags.DEFINE_string("worker_hosts", "",
                           "Comma-separated list of hostname:port pairs")
tf.app.flags.DEFINE_string("job_name", "", "One of 'ps', 'worker'")
tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")

# Hyperparameters
learning_rate = FLAGS.learning_rate
steps_to_validate = FLAGS.steps_to_validate

def main(_):
  ps_hosts = FLAGS.ps_hosts.split(",")
  worker_hosts = FLAGS.worker_hosts.split(",")
  cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
  server = tf.train.Server(cluster,job_name=FLAGS.job_name,task_index=FLAGS.task_index)

  if FLAGS.job_name == "ps":
    server.join()
  elif FLAGS.job_name == "worker":
    with tf.device(tf.train.replica_device_setter(
                    worker_device="/job:worker/task:%d" % FLAGS.task_index,
                    cluster=cluster)):
      global_step = tf.Variable(0, name='global_step', trainable=False)

      input = tf.placeholder("float")
      label = tf.placeholder("float")

      weight = tf.get_variable("weight", [1], tf.float32, initializer=tf.random_normal_initializer())
      biase  = tf.get_variable("biase", [1], tf.float32, initializer=tf.random_normal_initializer())
      pred = tf.mul(input, weight) + biase

      loss_value = loss(label, pred)

      train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss_value, global_step=global_step)
      init_op = tf.initialize_all_variables()

      saver = tf.train.Saver()
      tf.scalar_summary('cost', loss_value)
      summary_op = tf.merge_all_summaries()

    sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
                            logdir="./checkpoint/",
                            init_op=init_op,
                            summary_op=None,
                            saver=saver,
                            global_step=global_step,
                            save_model_secs=60)      
    with sv.managed_session(server.target) as sess:
      step = 0
      while  step < 1000000:
        train_x = np.random.randn(1)
        train_y = 2 * train_x + np.random.randn(1) * 0.33  + 10
        _, loss_v, step = sess.run([train_op, loss_value,global_step], feed_dict={input:train_x, label:train_y})
        if step % steps_to_validate == 0:
          w,b = sess.run([weight,biase])
          print("step: %d, weight: %f, biase: %f, loss: %f" %(step, w, b, loss_v))

    sv.stop()

def loss(label, pred):
  return tf.square(label - pred)



if __name__ == "__main__":
  tf.app.run()
#ps 节点执行: 
CUDA_VISIBLE_DEVICES='' python distribute.py --ps_hosts=192.168.1.100:2222 --worker_hosts=192.168.1.100:2224,192.168.1.100:2225 --job_name=ps --task_index=0

#worker 节点执行:
CUDA_VISIBLE_DEVICES=0 python distribute.py --ps_hosts=192.168.1.100:2222 --worker_hosts=192.168.1.100:2224,192.168.1.100:2225 --job_name=worker --task_index=0

CUDA_VISIBLE_DEVICES='' python distribute.py --ps_hosts=192.168.1.100:2222 --worker_hosts=192.168.1.100:2224,192.168.1.100:2225 --job_name=worker --task_index=1 

要按顺序来,ps是参数服务器,可以是多个。worker就是训练集群。
CUDA_VISIBLE_DEVICES=0 表示使用CUDA
CUDA_VISIBLE_DEVICES=‘’ 表示使用CPU

猜你喜欢

转载自blog.csdn.net/cq361106306/article/details/52929468
今日推荐