[Tensorflow2.0] Use a single GPU to train the model

The training process of deep learning is often very time-consuming. Training a model for several hours is commonplace. Training for a few days is also a common thing. Sometimes it takes even tens of days to train.

The time-consuming training process mainly comes from two parts, one part comes from data preparation and the other part comes from parameter iteration.

When the data preparation process is still the main bottleneck of the model training time, we can use more processes to prepare the data.

When the parameter iteration process becomes the main bottleneck of training time, our usual method is to use GPU or Google's TPU to accelerate.

See "Accelerating Keras Model with GPU-Colab Free GPU Usage Guide" for details

https://zhuanlan.zhihu.com/p/68509398

Whether it is a built-in fit method or a custom training loop, it is very convenient to switch from the CPU to a single GPU training model without changing any code. When a GPU is available, if you do not specify a device, tensorflow will automatically prefer to use the GPU to create tensors and perform tensor calculations.

However, if it is a server environment in a company or school laboratory, when there are multiple GPUs and multiple users, in order not to allow a single classmate's task to occupy all GPU resources, other classmates cannot use it (tensorflow defaults to obtain all memory resources of all GPUs) Permissions, but in fact only use part of the resources of one GPU), we usually add the following lines of code at the beginning to control the GPU number and memory size used by each task, so that other students can also train the model at the same time.

In the Colab notebook: modify-> notebook settings-> select GPU in the hardware accelerator

Note: The following code can only be executed correctly on Colab.

You can test the effect "tf_single GPU" through the following colab link:

https://colab.research.google.com/drive/1r5dLoeJq5z01sU72BX2M5UiNSkuxsEFe

%tensorflow_version 2.x
import tensorflow as tf
print(tf.__version__)
from tensorflow.keras import * 
 
# 打印时间分割线
@tf.function
def printbar():
    ts = tf.timestamp()
    today_ts = ts%(24*60*60)
 
    hour = tf.cast(today_ts//3600+8,tf.int32)%tf.constant(24)
    minite = tf.cast((today_ts%3600)//60,tf.int32)
    second = tf.cast(tf.floor(today_ts%60),tf.int32)
 
    def timeformat(m):
        if tf.strings.length(tf.strings.format("{}",m))==1:
            return(tf.strings.format("0{}",m))
        else:
            return(tf.strings.format("{}",m))
 
    timestring = tf.strings.join([timeformat(hour),timeformat(minite),
                timeformat(second)],separator = ":")
    tf.print("=========="*8,end = "")
    tf.print(timestring)

2.2.0-rc2

One, GPU settings

= tf.config.list_physical_devices GPUs ( " GPU " ) 
 
IF GPUs: 
    GPUO = GPUs [0] # If there are multiple GPU, only the 0th GPU 
    tf.config.experimental.set_memory_growth (GPUO, True) # Set GPU memory Use the amount as needed 
    # or you can set the GPU memory to a fixed amount (for example: 4G) 
    # tf.config.experimental.set_virtual_device_configuration (gpu0, 
    #     [tf.config.experimental.VirtualDeviceConfiguration (memory_limit = 4096)]) 
    tf.config .set_visible_devices ([GPUO], " GPU " ) 
 # comparative calculation speed of the CPU and GPU 

printbar () 
with tf.device ( " / GPU: 0 "):
    tf.random.set_seed(0)
    a = tf.random.uniform((10000,100),minval = 0,maxval = 3.0)
    b = tf.random.uniform((100,100000),minval = 0,maxval = 3.0)
    c = a@b
    tf.print(tf.reduce_sum(tf.reduce_sum(c,axis = 0),axis=0))
printbar()

printbar()
with tf.device("/cpu:0"):
    tf.random.set_seed(0)
    a = tf.random.uniform((10000,100),minval = 0,maxval = 3.0)
    b = tf.random.uniform((100,100000),minval = 0,maxval = 3.0)
    c = a@b
    tf.print(tf.reduce_sum(tf.reduce_sum(c,axis = 0),axis=0))
printbar()
================================================== ============================== 11:59:21 
2.24953778e + 11 
=========== ================================================== =================== 11:59:23 
========================== ================================================== ==== 11:59:23 
2.24953795e + 11 
===================================== =========================================== 11:59:29

Second, prepare the data

MAX_LEN = 300
BATCH_SIZE = 32
(x_train,y_train),(x_test,y_test) = datasets.reuters.load_data()
x_train = preprocessing.sequence.pad_sequences(x_train,maxlen=MAX_LEN)
x_test = preprocessing.sequence.pad_sequences(x_test,maxlen=MAX_LEN)
 
MAX_WORDS = x_train.max()+1
CAT_NUM = y_train.max()+1
 
ds_train = tf.data.Dataset.from_tensor_slices((x_train,y_train)) \
          .shuffle(buffer_size = 1000).batch(BATCH_SIZE) \
          .prefetch(tf.data.experimental.AUTOTUNE).cache()
 
ds_test = tf.data.Dataset.from_tensor_slices((x_test,y_test)) \
          .shuffle(buffer_size = 1000).batch(BATCH_SIZE) \
          .prefetch(tf.data.experimental.AUTOTUNE).cache()

Three, define the model

tf.keras.backend.clear_session()
 
def create_model():
 
    model = models.Sequential()
 
    model.add(layers.Embedding(MAX_WORDS,7,input_length=MAX_LEN))
    model.add(layers.Conv1D(filters = 64,kernel_size = 5,activation = "relu"))
    model.add(layers.MaxPool1D(2))
    model.add(layers.Conv1D(filters = 32,kernel_size = 3,activation = "relu"))
    model.add(layers.MaxPool1D(2))
    model.add(layers.Flatten())
    model.add(layers.Dense(CAT_NUM,activation = "softmax"))
    return(model)
 
model = create_model()
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 300, 7)            216874    
_________________________________________________________________
conv1d (Conv1D)              (None, 296, 64)           2304      
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 148, 64)           0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 146, 32)           6176      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 73, 32)            0         
_________________________________________________________________
flatten (Flatten)            (None, 2336)              0         
_________________________________________________________________
dense (Dense)                (None, 46)                107502    
=================================================================
Total params: 332,856
Trainable params: 332,856
Non-trainable params: 0
_________________________________________________________________

Fourth, the training model

optimizer = optimizers.Nadam()
loss_func = losses.SparseCategoricalCrossentropy()
 
train_loss = metrics.Mean(name='train_loss')
train_metric = metrics.SparseCategoricalAccuracy(name='train_accuracy')
 
valid_loss = metrics.Mean(name='valid_loss')
valid_metric = metrics.SparseCategoricalAccuracy(name='valid_accuracy')
 
@tf.function
def train_step(model, features, labels):
    with tf.GradientTape() as tape:
        predictions = model(features,training = True)
        loss = loss_func(labels, predictions)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
 
    train_loss.update_state(loss)
    train_metric.update_state(labels, predictions)
 
@tf.function
def valid_step(model, features, labels):
    predictions = model(features)
    batch_loss = loss_func(labels, predictions)
    valid_loss.update_state(batch_loss)
    valid_metric.update_state(labels, predictions)
 
 
def train_model(model,ds_train,ds_valid,epochs):
    for epoch in tf.range(1,epochs+1):
 
        for features, labels in ds_train:
            train_step(model,features,labels)
 
        for features, labels in ds_valid:
            valid_step(model,features,labels)
 
        logs = 'Epoch={},Loss:{},Accuracy:{},Valid Loss:{},Valid Accuracy:{}'
 
        if epoch%1 ==0:
            printbar()
            tf.print(tf.strings.format(logs,
            (epoch,train_loss.result(),train_metric.result(),valid_loss.result(),valid_metric.result())))
            tf.print("")
 
        train_loss.reset_states()
        valid_loss.reset_states()
        train_metric.reset_states()
        valid_metric.reset_states()
 
train_model(model,ds_train,ds_test,10)
================================================================================12:01:11
Epoch=1,Loss:2.00887108,Accuracy:0.470273882,Valid Loss:1.6704694,Valid Accuracy:0.566340148

================================================================================12:01:13
Epoch=2,Loss:1.47044504,Accuracy:0.618681788,Valid Loss:1.51738906,Valid Accuracy:0.630454123

================================================================================12:01:14
Epoch=3,Loss:1.1620506,Accuracy:0.700289488,Valid Loss:1.52190566,Valid Accuracy:0.641139805

================================================================================12:01:16
Epoch=4,Loss:0.878907442,Accuracy:0.771654427,Valid Loss:1.67911685,Valid Accuracy:0.644256473

================================================================================12:01:17
Epoch=5,Loss:0.647668123,Accuracy:0.836450696,Valid Loss:1.93839979,Valid Accuracy:0.642475486

================================================================================12:01:19
Epoch=6,Loss:0.487838209,Accuracy:0.880538881,Valid Loss:2.20062685,Valid Accuracy:0.642030299

================================================================================12:01:21
Epoch=7,Loss:0.390418053,Accuracy:0.90670228,Valid Loss:2.32795334,Valid Accuracy:0.646482646

================================================================================12:01:22
Epoch=8,Loss:0.328294098,Accuracy:0.92351371,Valid Loss:2.44113493,Valid Accuracy:0.644701719

================================================================================12:01:24
Epoch=9,Loss:0.286735713,Accuracy:0.931195736,Valid Loss:2.5071857,Valid Accuracy:0.642920732

================================================================================12:01:25
Epoch=10,Loss:0.256434649,Accuracy:0.936428428,Valid Loss:2.60177088,Valid Accuracy:0.640249312

reference:

Open source e-book address: https://lyhue1991.github.io/eat_tensorflow2_in_30_days/

GitHub project address: https://github.com/lyhue1991/eat_tensorflow2_in_30_days

Guess you like

Origin www.cnblogs.com/xiximayou/p/12690628.html