Solve the problem of using keras training model Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR error report

Machine environment
Ubuntu18.04
rtx2080s
tensorflow-gpu 1.14
keras 2.2.3

An abnormal error is reported when using keras to train the model. According to the log analysis, it is probably a video memory overflow problem. By watch -n 1 nvidia-smiobserving the video memory usage, it is also found that the video memory will suddenly increase when the model is trained, and then the process will hang up.
insert image description here
It is also abnormal to modify the batchsize according to the reasons analyzed above, and then search for relevant information. Under normal circumstances, if there is no restriction, many codes of deep learning frameworks will apply for the entire video memory space when running, even if it does not need so much. resources, but after it is applied for, other programs are not allowed to use it, so if you run the code in this state, there will be a problem of insufficient video memory, so you only need to allocate the video memory allocation strategy for model training.

Keras is encapsulated on the basis of tensorflow, so you only need to modify the video memory allocation strategy of tensorflow, and add the following configuration to the train code.

import tensorflow as tf
config = tf.compat.v1.ConfigProto(gpu_options=tf.compat.v1.GPUOptions(allow_growth=True))
sess = tf.compat.v1.Session(config=config)
# tf.compat.v1.ConfigProto() 这是tensorflow2.0+版本的写法,这个方法的作用就是设置运行tensorflow代码的时候的一些配置,例如如何分配显存,是否打印日志等;所以它的参数都是 配置名称=True/False(默认为False) 这种形式
# gpu_options=tf.compat.v1.GPUOptions(allow_growth=True) 限制GPU资源的使用,此处allow_growth=True是动态分配显存,需要多少,申请多少,不是一成不变、而是一直变化
# sess = tf.compat.v1.Session(config=config) 让这些配置生效

After modifying the video memory allocation strategy, the model training invokes the GPU normally.
insert image description here

Guess you like

Origin blog.csdn.net/threestooegs/article/details/127856206