tf2 multi-card training in one machine

foreword

Based on docker, use two GPUs to train a custom model (Keras subclass).

Mirrored distributed strategy MirroredStrategy

There are many distributed strategies, only one of them is introduced here, which is convenient to get started quickly. Practice has proved that one machine with multiple cards is feasible
. For other distributed strategies, please refer to: https://blog.csdn.net/u010099177/article/details/106074932

tf.distribute.MirroredStrategySupports synchronous distributed training on a single machine with multiple GPUs . It creates a copy on each GPU device. Every variable in the model will be mirrored across all copies . These variables together form a conceptual variable called MirroredVariable. These variables are kept in sync with each other by applying the same updates .

Efficient reduction algorithms are used to pass variable updates between devices. Full reduction aggregates tensors on different devices and makes them available on all devices. This is a fusion algorithm that is very efficient and can greatly reduce the overhead of synchronization. Depending on the type of communication available between devices, there are many reduction algorithms and implementations available, with NVIDIA NCCL being used by default. You can choose from other options we provide or write your own.

This is MirroredStrategythe easiest :

strategy = tf.distribute.MirroredStrategy()

This will create an MirroredStrategyinstance that will use all GPUs visible to TensorFlow, using NCCL for cross-device communication.

If you only want to use some of the GPUs on your computer, you can do this:

strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])

We have tf.distribute.Strategyintegrated into tf.keras. tf.kerasis a high-level API for building and training models. By integrating into the tf.kerasbackend , programs written with the Keras training framework can seamlessly perform distributed training.

You need to make the following changes in your code:

  • Create an tf.distribute.Strategyinstance
  • Moved the Keras model creation and compilation process strategy.scopeto
  • Supports various types of Keras models: sequential, functional, and subclassed

Here is a very simple Keras model example:

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
  model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
  model.compile(loss='mse', optimizer='sgd')

Just put your model creation part and compile part into strategy.scope()it

implementation code

The key code is as follows, using 0 card and 1 card:

callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=1e-8, patience=0, verbose=2)]

opt = optimizers.SGD(learning_rate=0.001, )

strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])
with strategy.scope():
    # 创建自定义model
    model = FCINN(dense_feature_columns, sparse_feature_columns, len(dense_features), hidden_units=(512, 256, 128), activation='relu', dropout=(0.3, 0.2, 0.2),
                  k_vector=8, w_reg=0.01, v_reg=0.01, mode='inner',
                  filters=[16, 18, 22, 24], kernel_with=[7, 7, 7, 7], dnn_maps=[3, 3, 3, 3], pooling_width=[2, 2, 2, 2]
                  )

    model.compile(
        optimizer=opt,
        loss='binary_crossentropy',
        metrics=['AUC', 'Precision', 'Recall', 'accuracy']
    )

model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=2,
    verbose=2,
    callbacks=callbacks,
)
docker run -d --gpus '"device=0,1"' \
    --rm -it --name ctr_tf_tmp \
    -v /data/wangguisen/ctr_note/new_thought:/ad_ctr/new_thought \
    -v /data/wangguisen/ctr_note/data:/ad_ctr/data \
    ad_ctr:3.0 \
    sh -c 'python3 -u /ad_ctr/new_thought/moreGPU.py 1>>/ad_ctr/new_thought/log/moreGPU.log 2>>/ad_ctr/new_thought/log/moreGPU.err'

run successfully

insert image description here
Looking at the usage rate and memory usage, it means that one machine with dual cards is running successfully.

One machine and one card:
insert image description here

One machine with multiple cards:
insert image description here

refer to:

Docker specifies the use of certain graphics cards:
https://blog.csdn.net/qq_21768483/article/details/115204043

tf2 Dataset use:
https://blog.csdn.net/u012513618/article/details/109671774

Distributed training with TensorFlow 2.0:
https://blog.csdn.net/u010099177/article/details/106074932

https://www.cnblogs.com/xiximayou/p/12690709.html

Guess you like

Origin blog.csdn.net/qq_42363032/article/details/122880645
tf2