tensorflow中使用keras进行多gpu训练并进行模型保存

本文主要讲解了在tensorflow中使用keras进行多gpu训练的一些注意事项，分为三个部分

使用多gpu训练的好处
keras如何使用多gpu训练并使用检查点保存模型
使用model.save保存模型后加载继续训练或者预测的注意事项

1. 使用多gpu训练的好处

（1）多gpu并行计算加速训练。单机多gpu训练实际上是将batch_size平均划分到多个gpu上进行并行计算，比如batch_size=8，使用2个gpu训练，那么系统会将4个batch平均分给每个gpu进行训练，可以加速训练的过程。
（2）解决单个gpu训练时显存不足的问题。模型训练时需要的显存大小=网络本身的参数+batch_size*每个样本所占的显存（tensorflow会保存每个样本进行前向传播得到的中间结果，在反向传播求导时使用）。当我们使用的模型参数很多或者batch_size设置较大的时候，单个gpu对应的显存可能就不能容纳，从而会显存溢出无法训练。这个时候使用多gpu会增加对应的总显存，解决显存溢出的问题。

2. keras如何使用多gpu训练并使用检查点保存模型

首先我们看看使用单gpu训练的代码

import tensorflow.keras as keras
def model():
	...
	return model
# 构建模型
model = model()
# 模型配置
model.compile(...)
# 使用检查点保存模型
callbacks = [keras.callbacks.ModelCheckpoint(...)]
# 训练模型
model.fit(...)

我们可以指定训练的gpu

os.environ["CUDA_VISIBLE_DEVICES"] = "2,3"

上面表示系统只能看到2,3号gpu，如果有上面的代码进行训练，会默认占着这2块gpu，但是实际上只会使用其中1块gpu进行训练，并没有使用2块训练，另一块只是强行占着。

要使用多gpu训练需要更改代码（需要的gpu数目>=2）

import tensorflow as tf
import tensorflow.keras as keras
# 使用多gpu
from tensorflow.keras.utils import multi_gpu_model

# 指定gpu
os.environ["CUDA_VISIBLE_DEVICES"] = "2,3"

def model():
	...
	return model
	
# 构建模型
## 为了避免在单块gpu构建模型会内存溢出，我们可以先在cpu上构建模型
with tf.device("/cpu:0"):
	model = model()

# 把模型转到2块gpu上
parallel_model = multi_gpu_model(model, gpus = 2)
# 模型配置
parallel_model.compile(...)

这时如果使用单gpu的检查点保存模型的 keras.callbacks.ModelCheckpoint 代码就会报错，这个错误是由于检查点默认使用 parallel_model.save() 来保存模型导致的，正确的做法是要使用最初的 model.save() 进行保存模型。
具体的做法是我们重新实现一个检查点的类：

class CustomModelCheckpoint(keras.callbacks.Callback):
    def __init__(self, model, path):
        self.model = model
        self.path = path
        self.best_loss = np.inf

    def on_epoch_end(self, epoch, logs=None):
        val_loss = logs['val_loss']
        if val_loss < self.best_loss:
            print("\nValidation loss decreased from {} to {}, saving model".format(self.best_loss, val_loss))
            self.model.save(self.path, overwrite=True)
            self.best_loss = val_loss
# 检查点
callbacks = [CustomModelCheckpoint(model=model, path=self.savemodel_path)]

# 模型训练
parallel_model.fit(callbacks = callbacks,...)

使用上面的代码会在2个gpu上进行并行训练，然后会把训练的loss传到cpu上取平均作为最后的loss。
保存模型时是在cpu上保存的，之后可以直接加载模型进行单gpu的继续训练以及预测。

3. 使用model.save保存模型后加载继续训练或者预测的注意事项

model.save() 保存模型时会保存网络结果、权重以及配置（损失和优化方法），因此我们加载模型进行继续训练或者预测时是不需要重新构建模型和配置。

# 加载模型
model = keras.models.load_model(self.savemodel_path)
# 如果有自定义的损失函数或者评价指标则需要使用，比如我自定义了dice_loss，需要制定custom_objects
def (self, y_true, y_pred):
	...
	return dice_loss
model = keras.models.load_model(self.savemodel_path,
                                custom_objects={'dice_loss' : self.dice_loss})
                               
# 继续训练方法，不需要model = model() 以及 model.compile(...)，只需要直接训练
model.fit()

# 用于预测
model.predict(...)

上面的继续训练的方法我在实际时发现这样只能使用单gpu进行继续训练，而我尝试把model转到多gpu上又会报错，应该是我转的方法不对，欢迎大家指正：

model = keras.models.load_model(self.savemodel_path)

parallel_model =  multi_gpu_model(model, gpus = 2)

parallel_model.fit()

用上面的代码报错，以及加上了 parallel_model.compile(…) 还是会报错。
我觉得可以尝试用 model.save_weights() 方法只保存权重，然后用重新建立模型后，再把权重加入到模型中，再转到多gpu，再配置模型应该会好，目前还没尝试。

Life will be better

发布了37 篇原创文章 · 获赞 6 · 访问量 5409

私信关注