令人绝望的TensorFlow-GPU，多种报错！！！

为了加速自己搞模型的效率，再三考虑后终于决定换上GPU版本的TensorFlow
但是！！！！
这个鬼报错差点把我搞疯！！！
我不止一次的想过就放弃吧，老老实实的回去用CPU版本，但是我本身极好的素质，坚强的毅力告诉我再坚持一下！！
终于搞出来了！！

这篇文章主要总结一下，我从安装到用之前学习时写的Mnist手写数字识别集的实验程序，所淌过所有的雷！

CUDA与CuDNN的安装

之前最先看到还要装这么两个玩意，我心里是拒绝的，但是没办法，搭环境嘛，都要走这一步的
这里其实也没有什么太多的雷，只是看到网上好多教程中是省略了安装vs2015这一步的

提醒大家！！先安装vs2015，2015版本好一点，17版可能会有不兼容的地方，不安装的话90%都是会失败的（莫名其妙的有些人确实成功了）,其实他只是学要一个VC++的编译环境，但是一定要有！！
GPU连接问题
不知道是我点儿背还是怎么滴，我看大家好像几乎没有人出现我这个问题
刚装好的时候，用我自己的Mnist卷积网络试验了一下效果，结果就出现了第一个大错
报错本身很简单
InternalError: Failed to create session.
一个报错卡了我有一天的时间。从网上四处搜资料，有的人说可能是你的GPU爆满，有的人说是CUDA的驱动版本不对，按照他们的方法搞了半天依旧不行，后来想起看看自己的CUDA版本吧，这才发现，装完CUDA之后我在桌面上右击菜单里的NVIDIA控制面板这个选项都没了，赶紧从控制面板里找，结果发现根本没有办法打开！
提示是您当前未使用连接到nvidia gpu的显示器！
？？？？？
靠！终于感觉到曙光了，从网上搜到的教程https://jingyan.baidu.com/article/ac6a9a5e2819872b653eac20.html
终于解决了，满汉泪光的继续按下执行！
可是！！！！
ResourceExhaustedError (see above for traceback): OOM when allocating tensor
什么鬼啊！
又出现了一个新的报错，继续搜索资料，网上对于这个错的说法也是很悬，一般说他是GPU有点满，我看了一下自己的任务管理器，可是GPU使用情况分明只有1-2%的样子，感到一阵心累，没有办法继续升级驱动，发现驱动已经最新了，已经真的绝望了，看着自己的代码不知如何是好

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

tf.logging.set_verbosity(tf.logging.ERROR)
#输入数据Mnist数据集
mnist = input_data.read_data_sets("E:\\mnist_dataset",one_hot = True)

#建立读取批次机制
batch_size = 100
n_batch = mnist.train.num_examples // batch_size

#权值初始化
def Weights_Variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)
#偏置值初始化
def Biases_Variable(shape):
    initial = tf.constant(0.1,shape = shape)
    return tf.Variable(initial)

#定义卷积函数
def conv2d(x,W):
    return tf.nn.conv2d(x,W,strides = [1,1,1,1],padding = 'SAME')
#定义池化层
def max_pool_2x2(x):
    return tf.nn.max_pool(x,ksize = [1,2,2,1],strides = [1,2,2,1],padding = 'SAME')

#建立变量
x = tf.placeholder(tf.float32,[None,784])
y = tf.placeholder(tf.float32,[None,10])
keep_prob = tf.placeholder(tf.float32)

#将输入数据转变为4维数组28*28的图片
x_image = tf.reshape(x,[-1,28,28,1])

#建立卷积神经网络
#第一层
W_conv1 = Weights_Variable([5,5,1,32])
b_conv1 = Biases_Variable([32])
#添加进激活函数
h_conv1 = tf.nn.relu(conv2d(x_image,W_conv1) + b_conv1)
#池化层
h_pool1 = max_pool_2x2(h_conv1)

#第二层
W_conv2 = Weights_Variable([5,5,32,64])
b_conv2 = Biases_Variable([64])
#添加进激活函数
h_conv2 = tf.nn.relu(conv2d(h_pool1,W_conv2) + b_conv2)
#池化层
h_pool2 = max_pool_2x2(h_conv2)

#建立全连接层
#将7*7*64卷积输出矩阵扁平化为一维矩阵
h_pool2_flat = tf.reshape(h_pool2,[-1,7*7*64])
#第一层
W_fc1 = Weights_Variable([7*7*64,1024])
b_fc1 = Biases_Variable([1024])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat,W_fc1)+b_fc1)
#利用Dropout方法设置输出概率，防止过拟合
h_fc1_drop = tf.nn.dropout(h_fc1,keep_prob)

#第二层
W_fc2 = Weights_Variable([1024,10])
b_fc2 = Biases_Variable([10])
prediction = tf.nn.softmax(tf.matmul(h_fc1_drop,W_fc2)+b_fc2)

#建立训练模型
#建立交叉熵代价函数
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=y,logits=tf.matmul(h_fc1,W_fc2)+b_fc2))
#利用AdagradOptimizer优化器进行训练
train_step = tf.train.AdamOptimizer(3e-4).minimize(loss)
#建立准确率计算机制
correct_prediction = tf.equal(tf.argmax(y,1),tf.argmax(prediction,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction,tf.float32))

#开始训练
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for epoch in range(51):
        for batch in range(n_batch):
            batch_xs,batch_ys = mnist.train.next_batch(batch_size)
            sess.run(train_step,feed_dict = {x: batch_xs,y: batch_ys,keep_prob: 0.7})
            
        acc = sess.run(accuracy,feed_dict = {x: mnist.test.images,y: mnist.test.labels,keep_prob: 1.0})
        print("Iter: "+str(epoch)+", Testing Accuracy: "+str(acc))

报错指向 acc = sess.run(accuracy,feed_dict = {x: mnist.test.images,y: mnist.test.labels,keep_prob: 1.0})这句
想哭却又不知道从哪开始，突然我发现了这位大神的博客：https://blog.csdn.net/xjbada/article/details/65633355
他告诉我说，这是因为测试时一时间输入的数据量太过庞大，训练是按批次输入的，所以可以把测试时也更改一下改成批次输入！！！
茅塞顿开！！
十分感谢！！！！
所以按照这位大神的说法，我把运行部分改成了：`#开始训练

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for epoch in range(51):
        for batch in range(n_batch):
            batch_xs,batch_ys = mnist.train.next_batch(batch_size)
            sess.run(train_step,feed_dict = {x: batch_xs,y: batch_ys,keep_prob: 0.7})
        
        good = 0
        total = 0
        for test_batch in range(test_n_batch):
            testSet_x,testSet_y = mnist.test.next_batch(test_batch_size)
            good += sess.run(accuracy,feed_dict={ x: testSet_x, y: testSet_y, keep_prob: 1.0})
            total += testSet_x.shape[0]

        acc = good/total
        print("Iter: "+str(epoch)+", Testing Accuracy: "+str(acc))

果然没有问题了！！
其实只是把原来的直接运算平均值改成了先求和再除以总数的方法，果然啊！！

其实只是想把自己趟过的坑分享给大家，以后再有初学者找资料时，就不需要像我这样找了不知道多久，就是找不到适合自己的问题的，希望大家的学习能够顺利！！！

扫描二维码关注公众号，回复： 9416883 查看本文章

Listening Rift

发布了6 篇原创文章 · 获赞 8 · 访问量 1578

私信关注

令人绝望的TensorFlow-GPU，多种报错！！！

猜你喜欢