《一个图像复原实例入门深度学习&TensorFlow—第七篇》训练网络

训练网络

1. 训练的目的

训练的目的就是要不断调整网络中神经元之间的连接权值W和偏置b，使得我们构建的卷积神经网络对于所有训练输入数据x的输出y_conv和理想输出y_label之间的差异逐渐变小。上一篇博文我们构建了卷积神经网络，然后直接让网络输出结果，因为这时网络中的W和b都是随初始化的，网络输出结果没有依靠任何训练集中的信息，所得结果肯定不会太好。果不其然，所得结果差的不行。这里打个比方，一个婴儿没有经过任何训练，你不管让他干什么他都不会给你一个满意的结果，因为他什么都不知道嘛（只知道饿了哭着要吃奶）。婴儿会慢慢长大，他会接受到各种各样的信息，过了一年他学会了如何开口叫爸妈，如何走路，父母给他接触的东西越多他会的也就越多，如果教育得好的话，他甚至能考上SEU、USTC、SIOM这些好学校呢！你就是网络他爸妈，你可不希望你家小CNN变成人工智障吧，他可是很有潜力的哦，只要你用心培养他，他不会让你失望的，你把他教育好了，以后他才有能力给你养老嘛。

2. 如何训练

参照吴恩达机器学习公开课的第2、10章：https://study.163.com/course/courseMain.htm?courseId=1004570029
你要确保你大致知道：
1、什么是代价函数J，由哪些因素决定？
简单的理解：我们构建的卷积神经网络对于所有训练输入数据x的输出y_conv和理想输出y_label之间的差异程度，训练集中的x和y_label是不变的，在卷积神经网络构建好之后，y_conv就只与各个权值W和b有关，因此代价函数J是权值W和偏置b的函数。训练过程就是：改变W和b来最小化J
2、什么是梯度下降法？
参数（W和b）是随机初始化的，因此梯度下降的起点可能是代价函数曲线中的任意一点，求该点的梯度(斜率)找到使得代价函数更小的移动方向，然后将参数向这个方向移动，移动的步长受学习率α和梯度共同影响，然后在新的起点继续求梯度找方向，移动得到新的权值，反复循环n次（n为训练次数，也就是参数向代价函数更小的方向移动了n次）。
这里写图片描述

3. 在TensorFlow中训练网络

在计算图（graph）中定义训练过程只需要这两行：

loss = tf.reduce_mean(tf.square(y_conv - y_label))     # 定义代价函数为均方误差
train_op = tf.train.AdamOptimizer(1e-4).minimize(loss) # 使用梯度下降的高级算法对参数进行寻优

然后在会话（Session）中反复运行下面这行代码：

# 将mini-batch feed给train_op 训练网络
sess.run(train_op,feed_dict={x:train_images_batch,y_label:train_labels_batch})

这里解释一下为什么用：tf.train.AdamOptimizer 而不是 tf.train.GradientDescentOptimizer。原因如下：
深度学习模型是一个复杂的非线性结构，一般属于非凸问题，这意味着存在很多局部最优点（鞍点），采用梯度下降算法可能会陷入局部最优。因此，我们注定在这个问题上成为“高级调参师”。可以看到，梯度下降算法中一个重要的参数是学习速率，适当的学习速率很重要：学习速率过小时收敛速度慢，而过大时导致训练震荡，而且可能会发散。理想的梯度下降算法要满足两点：收敛速度要快；能全局收敛。为了这个理想，出现了很多经典梯度下降算法的变种，AdamOptimizer就是他们中最优秀的那一个，直接用就好了。

我们已经知道，计算代价函数需要训练集中的x和y_label，但是训练集中的数据量（55000张28x28的图片）实在是太大了，每次运算都要处理如此庞大的数据的话，训练过程会变得很慢很慢，所以我们就取一个mini-batch近似计算代价函数，然后我们调整参数使得网络在每一个mini-batch上都表现得好，那么我们也认为网络在整个训练集上也表现得好。
下面给出我们的完整代码，将在上一篇博文中构建的CNN网络在训练集上训练后观察在测试集上的表现。

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import os
import time

time_start=time.time() # time.time()为1970.1.1到当前时间的毫秒数
train_image_path = 'E:\\MNIST_data\\train_images\\'                # 输入图像的路径
train_label_path = 'E:\\MNIST_data\\train_labels\\'                # 输出图像的路径
Train_TFRecord_path = 'E:\\MNIST_data\\tfrecord\\train_data_set.tfrecord'# 输出TFRecord文件的路径

test_image_path = 'E:\\MNIST_data\\test_images\\'                # 输入图像的路径
test_label_path = 'E:\\MNIST_data\\test_labels\\'                # 输出图像的路径
Test_TFRecord_path = 'E:\\MNIST_data\\tfrecord\\test_data_set.tfrecord'# 输出TFRecord文件的路径

img_W = 28      # 图像宽度
img_H = 28      # 图像高度
batch_size = 10 # 每个mini-batch含有的样本数量
min_after_dequeue = 1000 # 队列中最少文件数量
capacity = min_after_dequeue + 3*batch_size # 队列中最多文件数量

def _bytes_feature(value): # 生成字符串型的属性，用于存储图片像素信息，根据自己问题的要求选择要存的属性
     return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

# 将image_path和label_path中的图片一一对应封装在TFRecord_path中
def generate_TFRecordfile(image_path,label_path,TFRecord_path):
    images = []
    labels = []
    for file in os.listdir(image_path):
        images.append(image_path+file) # 得到所有转置图像的文件名
    for file in os.listdir(label_path):
        labels.append(label_path+file) # 得到所有未转置图像的文件名
    num_examples = len(images)         # 统计有多少用于训练的图片
    print('There are %d images\n'%(num_examples))

    writer = tf.python_io.TFRecordWriter(TFRecord_path) #创建一个writer写TFRecord文件
    for index in range(num_examples):
        image = Image.open(images[index]) # 打开一个image
        image = image.tobytes()           # 转换为字符型格式（因为之前生成的也是字符串型的属性嘛）
        label = Image.open(labels[index]) # 打开一个对应的label
        label = label.tobytes()           # 转换为字符型格式（因为之前生成的也是字符串型的属性嘛）
        #将一个样例转换为Example Protocol Buffer的格式，并且一组数据的信息都写入这个数据结构中,(打包咯)
        example = tf.train.Example(features=tf.train.Features(feature={
            'image':_bytes_feature(image),
            'label':_bytes_feature(label)}))
        writer.write(example.SerializeToString())#将这个example 写入TFRecord文件
    print('TFRecord file was generated successfully\n')
    writer.close()

def get_batch(TFRecord_path):
    reader = tf.TFRecordReader() # 创建一个reader来读取TFRecord文件中的样例 
    files = tf.train.match_filenames_once(TFRecord_path) # 获取文件列表
    # 创建文件名队列，乱序，每个样本使用num_epochs次
    filename_queue = tf.train.string_input_producer(files,shuffle = True,num_epochs = None) 

    # 读取并解析一个样本
    _,example = reader.read(filename_queue)
    features = tf.parse_single_example(
        example,
        features={
            'image':tf.FixedLenFeature([],tf.string),
            'label':tf.FixedLenFeature([],tf.string)})

    # 使用tf.decode_raw将字符串解析成图像对应的像素数组 （）
    images = tf.decode_raw(features['image'],tf.uint8)
    labels = tf.decode_raw(features['label'],tf.uint8)

    # 所得像素数组为shape为(（img_W*img_H）,)，应该reshape
    images = tf.reshape(images, shape=[img_W,img_H])
    labels = tf.reshape(labels, shape=[img_W,img_H])

    #在这里添加图像预处理函数（optional）

    #使用tf.train.shuffle_batch来随机组合数据生成用于随机梯度下降的mini-batch
    Image_Batch,Label_Batch = tf.train.shuffle_batch([images,labels],
                                             batch_size = batch_size,
                                             num_threads = 5,
                                             min_after_dequeue = min_after_dequeue,
                                             capacity = capacity)
    return Image_Batch,Label_Batch

    # 定义权重的函数
def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1) # 从截断的正态分布中输出随机值μ-2σ，μ+2σ
    return tf.Variable(initial)
    # 定义偏置的函数
def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)
    # 定义卷积层的函数
def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
    # 定义池化层的函数
def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                            strides=[1, 2, 2, 1], padding='SAME')
print('please wait for generating the TFRecord file of training sets...')       
generate_TFRecordfile(train_image_path,train_label_path,Train_TFRecord_path) # 生成训练集的TFRecord文件
print('please wait for generating the TFRecord file of test sets...')
generate_TFRecordfile(test_image_path,test_label_path,Test_TFRecord_path)    # 生成测试集的TFRecord文件

Train_Images_Batch,Train_Labels_Batch = get_batch(Train_TFRecord_path)   # 多线程读取训练集的TFRecord文件生成mini-batch       
Test_Images_Batch,Test_Labels_Batch = get_batch(Test_TFRecord_path)      # 多线程读取测试集的TFRecord文件生成mini-batch   

# 定义将mini-batch导入网络的占位符
x = tf.placeholder(tf.float32, shape=[None,img_W,img_H,1],name = 'images')
y_label = tf.placeholder(tf.float32, shape=[None,img_W,img_H,1],name = 'labels')
# 第一卷积层
W_conv1 = weight_variable([5, 5, 1, 32])  
b_conv1 = bias_variable([32])
h_conv1 = tf.nn.relu(conv2d(x, W_conv1) + b_conv1)
# 第一池化层    
h_pool1 = max_pool_2x2(h_conv1)
# 第二卷积层
W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
# 第二池化层      
h_pool2 = max_pool_2x2(h_conv2)
# 上采样层1        
W_de_conv1 = W_conv2
h_de_conv1 = tf.nn.conv2d_transpose(h_pool2,W_de_conv1,output_shape=[batch_size, 14, 14, 32],strides=[1,2,2,1],padding="SAME")
# 上采样层2
W_de_conv2 = W_conv1
h_de_conv2 = tf.nn.conv2d_transpose(h_de_conv1,W_de_conv2,output_shape=[batch_size, 28, 28, 1],strides=[1,2,2,1],padding="SAME")
# 网络输出的结果    
y_conv = h_de_conv2

loss = tf.reduce_mean(tf.square(y_conv - y_label))     # 定义代价函数为均方误差
train_op = tf.train.AdamOptimizer(1e-4).minimize(loss) # 使用高级优化算法对参数进行寻优 

init_op = (tf.local_variables_initializer(),tf.global_variables_initializer())#初始化操作
with tf.Session() as sess:
    sess.run(init_op)
    coord = tf.train.Coordinator() # 用于协调多个线程同时终止
    threads = tf.train.start_queue_runners(sess=sess,coord=coord) # 启动线程
    try:
        for step in range(100000):# 训练10万步
            if coord.should_stop(): # 读到结束标记后coord.should_stop()变为True，跳出循环
                break
            train_images_batch,train_labels_batch = sess.run([Train_Images_Batch,Train_Labels_Batch])

            train_images_batch = np.reshape(train_images_batch,[batch_size,img_W,img_H,1]) # 一个样本为行
            train_labels_batch = np.reshape(train_labels_batch,[batch_size,img_W,img_H,1])

            sess.run(train_op,feed_dict={x:train_images_batch,y_label:train_labels_batch})  # 将mini-batch feed给train_op 训练网络

            if step%100 == 0: # 每过100步输出网络训练集和测试集上的损失函数
                test_images_batch,test_labels_batch = sess.run([Test_Images_Batch,Test_Labels_Batch])
                test_images_batch = np.reshape(test_images_batch,[batch_size,img_W,img_H,1]) # 一个样本为行
                test_labels_batch = np.reshape(test_labels_batch,[batch_size,img_W,img_H,1])

                train_loss = sess.run(loss,feed_dict={x:train_images_batch,y_label:train_labels_batch})
                test_loss = sess.run(loss,feed_dict={x:test_images_batch,y_label:test_labels_batch})
                print('step %d: loss on training set batch:%d  loss on testing set batch:%d' % (step,train_loss,test_loss))

        test_images_batch,test_labels_batch = sess.run([Test_Images_Batch,Test_Labels_Batch])
        test_images_batch = np.reshape(test_images_batch,[batch_size,img_W,img_H,1]) # 一个样本为行
        test_labels_batch = np.reshape(test_labels_batch,[batch_size,img_W,img_H,1]) 

        y_pred = sess.run(y_conv,feed_dict={x:test_images_batch})
        #画个图        
        input_img = test_images_batch[0,:,:,:] # 取一个mini-batch(10张)中的一张出来看看
        output_img = y_pred[0,:,:,:]
        output_img[output_img<0] = 0      
        label = test_labels_batch[0,:,:,:]       
        input_img = np.reshape(input_img,[28,28])
        output_img = np.reshape(output_img,[28,28])
        label = np.reshape(label,[28,28])      
        input_img = Image.fromarray(input_img.astype('uint8')).convert('L')
        output_img = Image.fromarray(output_img.astype('uint8')).convert('L')
        label = Image.fromarray(label.astype('uint8')).convert('L')        
        plt.imshow(input_img)
        plt.show() 
        plt.imshow(output_img)
        plt.show() 
        plt.imshow(label)
        plt.show()                  
    except tf.errors.OutOfRangeError: # 捕捉文件名队列中的结束标记
        print('epoch limit reached')
        coord.request_stop() #通知其它线程停止读取数据
    finally:
        coord.request_stop()
        coord.join(threads) #等待所有线程退出
time_end=time.time() # time.time()为1970.1.1到当前时间的毫秒数
print('Total run time is : %f s' %(time_end-time_start))

输出结果：
这里写图片描述
运行时间：2699.547610 s，刚好看完亚洲杯国足打沙特操蛋的上半场的录像
有意思吧，网络输出结果勉强可以看出是把输入图像转置了，因为时间关系，我们只运行了10万步，可以增加训练步数，进一步降低代价函数，可以得到更精确的结果，后面还会介绍一些正则化方法来进一步提高网络的性能。
我们的网络好像有点笨，学东西好慢啊，所以下一节我们给弄个玩具（GPU）让它学东西更快些。刚刚把程序用单个GPU加速之后运行时间变为：559.236460 s