Tensorflow distributed programming-playing with common GPU configurations

table of Contents

 

Distributed background

Common commands

Code modification

Common distributed strategies

 


Distributed background

All GPUs are used by default, and the memory is fully occupied.
How to not waste memory and resources [memory self-growth, virtual device mechanism]
multi-GPU use [virtual GPU, actual GPU] [manual setting, distributed mechanism]

Common commands

 watch -n 0.1 -x nvidia-smi 实时监控GPU的使用情况
 nvidia-smi 查看当前GPU使用情况

Code modification

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import sklearn
import pandas as pd
import os
import sys
import time
import tensorflow as tf
from tensorflow import keras
print(tf.__version__)
print(sys.version_info)
for module in mpl,np,pd,sklearn,tf,keras:
    print(module.__name__,module.__version__)

tf.debugging.set_log_device_placement(True) # 记录物理设备
tf.config.set_soft_device_placement(True) # 无需显式设置device,会自动设置到合适的GPU
gpus = tf.config.experimental.list_physical_devices('GPU') # 物理GPU个数
tf.config.experimental.set_visible_devices(gpus[0],'GPU') # 设置对于这个notebook,仅有这个GPU可见
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu,True)  # 设置内存自增
# 设置逻辑分区
# tf.config.experimental.set_virtual_device_configuration(gpu[0],[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=3072),
#                                                                 tf.config.experimental.VirtualDeviceConfiguration(memory_limit=3072)])
print(len(gpus))
logical_gpus = tf.config.experimental.list_logical_devices('GPU') #逻辑GPU个数
print(len(logical_gpus))

c = []
for gpu in logical_gpus:
    print(gpu.name)
    with tf.device(gpu.name): # 设置某个GPU进行如下运算,构建一个环境
        a  = tf.constant([[1.,2.,2.],[2.,3.,4.]])
        b = tf.constant([[1.,2.],[2.,3.],[4.,5.]])
        c.append(tf.matmul(a,b)) # 矩阵乘法
with tf.device('CPU:0'):
    matmul_sum = tf.add_n(c) # 将列表中的数加起来
print(matmul_sum)
        
        

Common distributed strategies

  • Mirrored Strategy: Synchronous distributed training; suitable for a machine with multiple cards; each GPU has all the parameters of the network structure, and these parameters will be synchronized; data parallel (Batch divides the data into N parts and assigns them to each GPU; gradient aggregation is then updated to the parameters on each GPU;)
  • CentralStorageStrategy: a variant of MirroredStrategy; the parameters are not stored on each GPU, but stored on a device (CPU or GPU); calculations are parallel on all GPUs (except for the calculation of updated parameters)
  • MultiWorkerMirroredStrategy: Multi-machine multi-card, the rest is similar to M
  • TPUStrategy: The strategy used on TPU, the rest is similar to M
  • ParameterServerStrategy: asynchronous and distributed; more suitable for large-scale distributed systems; machines are divided into two types: ParameterServer and worker (the former is responsible for gradient integration, updating parameters, and the latter is responsible for calculating gradients and training the network)

The advantages and disadvantages of synchronization and asynchronous: multi-machine multi-card (asynchronous avoids short-board effect), one machine multi-card (synchronous avoids excessive communication), asynchronous calculation can increase the generalization ability of the model

strategy = tf.distribute.MirroredStrategy() # 注意batch_size要扩大len(gpus)的倍数
# estimator需要多配置一个config,再在model_to_estimator中加上config参数
# 对dataset进行改造的时候需要使用strategy.experimental_distribute_dataset()对dataset进行分装 
config = tf.estimator.RunConfig(train_distribute=strategy)
# model。fit直接使用下面的方法即可
with strategy.scope():
    a  = tf.constant([[1.,2.,2.],[2.,3.,4.]])
    b = tf.constant([[1.,2.],[2.,3.],[4.,5.]])
    c.append(tf.matmul(a,b)) # 矩阵乘法
with tf.device('CPU:0'):
    matmul_sum = tf.add_n(c) # 将列表中的数加起来
print(matmul_sum)
    


 

Guess you like

Origin blog.csdn.net/weixin_40539952/article/details/107996911