Tensorflow实现Alexnet对Imagenet的训练与评测

在上一篇博文中已经提到了如何对Imagenent 2012图像分类大赛的数据进行预处理。在此基础之上，我们可以搭建不同的神经网络来学习如何对Imagenet的数据进行训练和预测。Imagenet的数据集足够庞大，有120多万张图片和1000个类别，是CIFAR 10数据集无法比拟的，我觉得只有真正在Imagenet的数据集上训练出一个好的模型，才能证明在计算机视觉领域具备解决实际问题的能力。

在这里，我首先研究如何采用Tensorflow来搭建一个Alexnet的模型。在Tensorflow的模型库里面，有一个官方的Alexnet的模型，但是这个模型并不完整，只有卷积层，没有全连接层，而且也不涉及到模型的实际训练和评测，只是用于计算Alexnet的前向和反向神经网络计算的性能。我在网上也搜索了一下，大部分用TensorFlow来实现Alexnet的代码基本都是复制的官方模型的代码，没有完整的模型构建和训练的过程。只有少部分的代码实现了完整的ALEXNET，但是在图像构建上采用的是比较旧的方法，没有用到Dataset的方式来做，性能上也很缓慢。因此我决定自己重写一个代码，来验证一下在深度学习上掌握的能力。

Alexnet是Hinton的学生Alex Kirzhevsky在2012年提出的一个深度卷积网络模型，该模型赢得了2012年的Imagenet图像识别大赛，以巨大的优势领先第二名，从而真正开启了深度学习的研究热潮。其网络模型如下图所示：

Alexnet

这个模型在当时提出来时受到GPU的内存所限，只能把模型拆成两部分放到两个GPU上来并行运行。现在我的GPU的内存有8G，因此我在实现上就把这个模型放到一个GPU上来跑。

具体的模型架构的解释如下：

图像输入是224*224*3，第一个卷积层是11*11，Stride是4。要注意的是，如果在TensorFlow上来实现，当卷积层的Padding参数设置为SAME时，卷积后的图片尺寸为224/4=56。如果要按照原模型提到的卷积后为55的图片大小，那么需要调整输入的图片大小为227*227*3，卷积层的Padding参数设置为VALID，卷积后的图片大小为(227-11+1)/4=55，不过这两种方式对最终结果没有影响，因此我这里是选择了224*224*3作为图片输入大小。卷积后的通道数为96。
第一个卷积层处理完后，采用RELU作为激活函数，然后添加Local Normalization Layer，这个的作用是模仿生物学的原理，对相邻多个通道的激活值取最大值，这个可以采用TensorFlow的LRN函数来处理。LRN的处理所消耗的时间较长，按照论文的描述可以提高最终预测性能1%左右，在实际训练中我就把LRN的代码先注释掉了。之后对处理结果做最大池化处理，这样做以前普遍采用的平均池化效果更好。最大池化层的Kernel Size是3*3， Stride是2*2，Padding是VALID，Stride的步长比Kernel Size稍小，可以进行重叠扫描，获取更好的效果。池化后的数据维度为(56-3+1)/2=27
第二个卷积层的输入是27*27*96，卷积层的Kernel的大小是5*5，Stride是1，Padding是SAME，通道数是256，因此卷积后的数据维度是27*27*256，采用RELU作为激活函数，然后添加Local Normalization Layer。之后再接一个Max Pool层，参数和上一个池化层一样，因此池化后的数据维度为(27-3+1)/2=13
第三个卷积层的输入是13*13*256，卷积层的Kernel的大小是3*3，Stride是1，Padding是SAME，通道数是384，因此卷积后的数据维度是13*13*384，采用RELU作为激活函数。
第四个卷积层的输入是13*13*384，卷积层的Kernel的大小是3*3，Stride是1，Padding是SAME，通道数是384，因此卷积后的数据维度是13*13*384，采用RELU作为激活函数。
第五个卷积层的输入是13*13*384，卷积层的Kernel的大小是3*3，Stride是1，Padding是SAME，通道数是256，因此卷积后的数据维度是13*13*256，采用RELU作为激活函数。之后接一个Max Pool层，参数和上一个池化层一样，因此池化后的数据维度为(13-3+1)/2=6
第五个卷积层的输出为6*6*256，把数据展平后接2个全连接层，每个层都是4096个输出节点。这两个全连接层在训练的时候可以进行Dropout的处理。激活函数都是RELU
最后一层是一个1000个输出节点的全连接层，对输出结果进行SOFTMAX后即可计算出每一个类别的概率。

Alexnet模型的代码如下：

def inference(images, dropout_rate=1.0, wd=None):
    with tf.variable_scope('conv1', reuse=tf.AUTO_REUSE):
        kernel = tf.get_variable(initializer=tf.truncated_normal([11,11,3,96], dtype=tf.float32, stddev=1e-1), trainable=True, name='weights')
        conv = tf.nn.conv2d(images, kernel, [1,4,4,1], padding='SAME')
        biases = tf.get_variable(initializer=tf.constant(0.1, shape=[96], dtype=tf.float32), trainable=True, name='biases')
        bias = tf.nn.bias_add(conv, biases)
        conv1 = tf.nn.relu(bias, name='conv1')
    lrn1 = tf.nn.lrn(conv1, 4, bias=1.0, alpha=0.001/9, beta=0.75, name='lrn1')
    pool1 = tf.nn.max_pool(lrn1, ksize=[1,3,3,1], strides=[1,2,2,1], padding='VALID', name='pool1')
    
    with tf.variable_scope('conv2', reuse=tf.AUTO_REUSE):
        kernel = tf.get_variable(initializer=tf.truncated_normal([5,5,96,256], dtype=tf.float32, stddev=1e-1), trainable=True, name='weights')
        conv = tf.nn.conv2d(pool1, kernel, [1,1,1,1], padding='SAME')
        biases = tf.get_variable(initializer=tf.constant(0.1, shape=[256], dtype=tf.float32), trainable=True, name='biases')
        bias = tf.nn.bias_add(conv, biases)
        conv2 = tf.nn.relu(bias, name='conv2')
    lrn2 = tf.nn.lrn(conv2, 4, bias=1.0, alpha=0.001/9, beta=0.75, name='lrn2')
    pool2 = tf.nn.max_pool(lrn2, ksize=[1,3,3,1], strides=[1,2,2,1], padding='VALID', name='pool2')

    with tf.variable_scope('conv3', reuse=tf.AUTO_REUSE):
        kernel = tf.get_variable(initializer=tf.truncated_normal([3,3,256,384], dtype=tf.float32, stddev=1e-1), trainable=True, name='weights')
        conv = tf.nn.conv2d(pool2, kernel, [1,1,1,1], padding='SAME')
        biases = tf.get_variable(initializer=tf.constant(0.1, shape=[384], dtype=tf.float32), trainable=True, name='biases')
        bias = tf.nn.bias_add(conv, biases)
        conv3 = tf.nn.relu(bias, name='conv3')

    with tf.variable_scope('conv4', reuse=tf.AUTO_REUSE):
        kernel = tf.get_variable(initializer=tf.truncated_normal([3,3,384,384], dtype=tf.float32, stddev=1e-1), trainable=True, name='weights')
        conv = tf.nn.conv2d(conv3, kernel, [1,1,1,1], padding='SAME')
        biases = tf.get_variable(initializer=tf.constant(0.1, shape=[384], dtype=tf.float32), trainable=True, name='biases')
        bias = tf.nn.bias_add(conv, biases)
        conv4 = tf.nn.relu(bias, name='conv4')

    with tf.variable_scope('conv5', reuse=tf.AUTO_REUSE):
        kernel = tf.get_variable(initializer=tf.truncated_normal([3,3,384,256], dtype=tf.float32, stddev=1e-1), trainable=True, name='weights')
        conv = tf.nn.conv2d(conv4, kernel, [1,1,1,1], padding='SAME')
        biases = tf.get_variable(initializer=tf.constant(0.1, shape=[256], dtype=tf.float32), trainable=True, name='biases')
        bias = tf.nn.bias_add(conv, biases)
        conv5 = tf.nn.relu(bias, name='conv5')
    pool5 = tf.nn.max_pool(conv5, ksize=[1,3,3,1], strides=[1,2,2,1], padding='VALID', name='pool5')

    flatten = tf.layers.flatten(inputs=pool5, name='flatten')

    with tf.variable_scope('local1', reuse=tf.AUTO_REUSE):
        weights = tf.get_variable(initializer=tf.truncated_normal([6*6*256,4096], dtype=tf.float32, stddev=1/4096.0), trainable=True, name='weights')
        if wd is not None:
            weights_loss = tf.multiply(tf.nn.l2_loss(weights), wd, name='weight_loss')
            tf.add_to_collection('losses', weights_loss)
        biases = tf.get_variable(initializer=tf.constant(1.0, shape=[4096], dtype=tf.float32), trainable=True, name='biases')
        local1 = tf.nn.relu(tf.nn.xw_plus_b(flatten, weights, biases), name='local1')
        local1 = tf.nn.dropout(local1, dropout_rate)

    with tf.variable_scope('local2', reuse=tf.AUTO_REUSE):
        weights = tf.get_variable(initializer=tf.truncated_normal([4096,4096], dtype=tf.float32, stddev=1/4096.0), trainable=True, name='weights')
        if wd is not None:
            weights_loss = tf.multiply(tf.nn.l2_loss(weights), wd, name='weight_loss')
            tf.add_to_collection('losses', weights_loss)
        biases = tf.get_variable(initializer=tf.constant(1.0, shape=[4096], dtype=tf.float32), trainable=True, name='biases')
        local2 = tf.nn.relu(tf.nn.xw_plus_b(local1, weights, biases), name='local2')
        local2 = tf.nn.dropout(local2, dropout_rate)

    with tf.variable_scope('local3', reuse=tf.AUTO_REUSE):
        weights = tf.get_variable(initializer=tf.truncated_normal([4096,1000], dtype=tf.float32, stddev=1e-3), trainable=True, name='weights')
        biases = tf.get_variable(initializer=tf.constant(1.0, shape=[1000], dtype=tf.float32), trainable=True, name='biases')
        local3 = tf.nn.xw_plus_b(local2, weights, biases, name='local3')

    return local3

Imagenet数据的处理的代码可见我上一篇博文。

以下的代码是读取Imagenet的数据并进行训练和验证。

import tensorflow as tf
import os
import time
import alexnet_model

imageWidth = 224
imageHeight = 224
imageDepth = 3
batch_size = 128
resize_min = 256

# Parse TFRECORD and distort the image for train
def _parse_function(example_proto):
    features = {"image": tf.FixedLenFeature([], tf.string, default_value=""),
                "height": tf.FixedLenFeature([1], tf.int64, default_value=[0]),
                "width": tf.FixedLenFeature([1], tf.int64, default_value=[0]),
                "channels": tf.FixedLenFeature([1], tf.int64, default_value=[3]),
                "colorspace": tf.FixedLenFeature([], tf.string, default_value=""),
                "img_format": tf.FixedLenFeature([], tf.string, default_value=""),
                "label": tf.FixedLenFeature([1], tf.int64, default_value=[0]),
                "bbox_xmin": tf.VarLenFeature(tf.float32),
                "bbox_xmax": tf.VarLenFeature(tf.float32),
                "bbox_ymin": tf.VarLenFeature(tf.float32),
                "bbox_ymax": tf.VarLenFeature(tf.float32),
                "text": tf.FixedLenFeature([], tf.string, default_value=""),
                "filename": tf.FixedLenFeature([], tf.string, default_value="")
               }
    parsed_features = tf.parse_single_example(example_proto, features)
    
    xmin = tf.expand_dims(parsed_features["bbox_xmin"].values, 0)
    xmax = tf.expand_dims(parsed_features["bbox_xmax"].values, 0)
    ymin = tf.expand_dims(parsed_features["bbox_ymin"].values, 0)
    ymax = tf.expand_dims(parsed_features["bbox_ymax"].values, 0)
    
    bbox = tf.concat(axis=0, values=[ymin, xmin, ymax, xmax])
    bbox = tf.expand_dims(bbox, 0)
    bbox = tf.transpose(bbox, [0, 2, 1])
    
    height = parsed_features["height"]
    width = parsed_features["width"]
    channels = parsed_features["channels"]
 
    bbox_begin, bbox_size, bbox_for_draw = tf.image.sample_distorted_bounding_box(
        tf.concat(axis=0, values=[height, width, channels]),
        bounding_boxes=bbox,
        min_object_covered=0.1,
        use_image_if_no_bounding_boxes=True)

    # Reassemble the bounding box in the format the crop op requires.
    offset_y, offset_x, _ = tf.unstack(bbox_begin)
    target_height, target_width, _ = tf.unstack(bbox_size)
    crop_window = tf.cast(tf.stack([offset_y, offset_x, target_height, target_width]), tf.int32)
    
    # Use the fused decode and crop op here, which is faster than each in series.
    cropped = tf.image.decode_and_crop_jpeg(parsed_features["image"], crop_window, channels=3)

    # Flip to add a little more random distortion in.
    cropped = tf.image.random_flip_left_right(cropped)
    
    image_train = tf.image.resize_images(cropped, [imageHeight, imageWidth], 
                                         method=tf.image.ResizeMethod.BILINEAR,align_corners=False)
    
    image_train = tf.cast(image_train, tf.uint8)
    image_train = tf.image.convert_image_dtype(image_train, tf.float32)
    return image_train, parsed_features["label"][0], parsed_features["text"], parsed_features["filename"]

with tf.device('/cpu:0'):
    train_files_names = os.listdir('train_tf/')
    train_files = ['/home/roy/AI/train_tf/'+item for item in train_files_names]
    dataset_train = tf.data.TFRecordDataset(train_files)
    dataset_train = dataset_train.map(_parse_function, num_parallel_calls=4)
    dataset_train = dataset_train.repeat(10)
    dataset_train = dataset_train.batch(batch_size)
    dataset_train = dataset_train.prefetch(batch_size)
    iterator = tf.data.Iterator.from_structure(dataset_train.output_types, dataset_train.output_shapes)
    next_images, next_labels, next_text, next_filenames = iterator.get_next()
    train_init_op = iterator.make_initializer(dataset_train)

def _parse_test_function(example_proto):
    features = {"image": tf.FixedLenFeature([], tf.string, default_value=""),
                "height": tf.FixedLenFeature([1], tf.int64, default_value=[0]),
                "width": tf.FixedLenFeature([1], tf.int64, default_value=[0]),
                "channels": tf.FixedLenFeature([1], tf.int64, default_value=[3]),
                "colorspace": tf.FixedLenFeature([], tf.string, default_value=""),
                "img_format": tf.FixedLenFeature([], tf.string, default_value=""),
                "label": tf.FixedLenFeature([1], tf.int64, default_value=[0]),
                "bbox_xmin": tf.VarLenFeature(tf.float32),
                "bbox_xmax": tf.VarLenFeature(tf.float32),
                "bbox_ymin": tf.VarLenFeature(tf.float32),
                "bbox_ymax": tf.VarLenFeature(tf.float32),
                "text": tf.FixedLenFeature([], tf.string, default_value=""),
                "filename": tf.FixedLenFeature([], tf.string, default_value="")
               }
    parsed_features = tf.parse_single_example(example_proto, features)
    image_decoded = tf.image.decode_jpeg(parsed_features["image"], channels=3)
    shape = tf.shape(image_decoded)
    height, width = shape[0], shape[1]
    resized_height, resized_width = tf.cond(height<width,
        lambda: (resize_min, tf.cast(tf.multiply(tf.cast(width, tf.float64),tf.divide(resize_min,height)), tf.int32)),
        lambda: (tf.cast(tf.multiply(tf.cast(height, tf.float64),tf.divide(resize_min,width)), tf.int32), resize_min))
    image_resized = tf.image.resize_images(image_decoded, [resized_height, resized_width])
    image_resized = tf.cast(image_resized, tf.uint8)
    image_resized = tf.image.convert_image_dtype(image_resized, tf.float32)
    
    # calculate how many to be center crop
    shape = tf.shape(image_resized)  
    height, width = shape[0], shape[1]
    amount_to_be_cropped_h = (height - imageHeight)
    crop_top = amount_to_be_cropped_h // 2
    amount_to_be_cropped_w = (width - imageWidth)
    crop_left = amount_to_be_cropped_w // 2
    image_valid = tf.slice(image_resized, [crop_top, crop_left, 0], [imageHeight, imageWidth, -1])
    return image_valid, parsed_features["label"][0], parsed_features["text"], parsed_features["filename"]

with tf.device('/cpu:0'):
    valid_files_names = os.listdir('valid_tf/')
    valid_files = ['/home/roy/AI/valid_tf/'+item for item in valid_files_names]
    dataset_valid = tf.data.TFRecordDataset(valid_files)
    dataset_valid = dataset_valid.map(_parse_test_function, num_parallel_calls=4)
    dataset_valid = dataset_valid.batch(batch_size)
    dataset_valid = dataset_valid.prefetch(batch_size)
    iterator_valid = tf.data.Iterator.from_structure(dataset_valid.output_types, dataset_valid.output_shapes)
    next_valid_images, next_valid_labels, next_valid_text, next_valid_filenames = iterator_valid.get_next()
    valid_init_op = iterator_valid.make_initializer(dataset_valid)

global_step = tf.Variable(0, trainable=False)
epoch_steps = int(1281167/batch_size)
boundaries = [epoch_steps*30, epoch_steps*40]
values = [0.01, 0.001, 0.0001]
learning_rate = tf.train.piecewise_constant(global_step, boundaries, values)
lr_summary = tf.summary.scalar('learning_rate', learning_rate)

result = alexnet_model.inference(next_images, dropout_rate=0.5, wd=0.0005)
output_result_scores = tf.nn.softmax(result)
output_result = tf.argmax(output_result_scores, 1)

#Calculate the cross entropy loss
cross_entropy = tf.losses.sparse_softmax_cross_entropy(labels=next_labels, logits=result)
cross_entropy_mean = tf.reduce_mean(cross_entropy, name='cross_entropy')
tf.add_to_collection('losses', cross_entropy_mean)
 
#Add the l2 weights to the loss
loss = tf.add_n(tf.get_collection('losses'), name='total_loss')
loss_summary = tf.summary.scalar('loss', loss)
 
#Define the optimizer
opt_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)

#Define the Exp moving average
ema = tf.train.ExponentialMovingAverage(decay=0.9999)
with tf.control_dependencies([opt_op]):
    optimize_op = ema.apply(tf.trainable_variables())

#Get the inference logits by the model for the validation images
result_valid = alexnet_model.inference(next_valid_images, dropout_rate=1.0, wd=None)
output_valid_scores = tf.nn.softmax(result_valid)
output_valid_result = tf.argmax(output_valid_scores, 1)
accuracy_valid_batch = tf.reduce_mean(tf.cast(tf.equal(next_valid_labels, tf.argmax(output_valid_scores, 1)), tf.float32))
accuracy_valid_top_5 = tf.reduce_mean(tf.cast(tf.nn.in_top_k(output_valid_scores, next_valid_labels, k=5), tf.float32))
acc_1_summary = tf.summary.scalar('accuracy_valid_top_1', accuracy_valid_batch)
acc_2_summary = tf.summary.scalar('accuracy_valid_top_5', accuracy_valid_top_5)

# Add ops to save and restore all the variables.
saver = tf.train.Saver()

with tf.Session() as sess:
    #saver.restore(sess, "model/model.ckpt-5000")
    sess.run(tf.global_variables_initializer())
    sess.run([global_step, train_init_op, valid_init_op])
    total_loss = 0.0
    epoch = 0
    starttime = time.time()
    while(True):
        try:
            loss_t, lr, step, _ = sess.run([loss, learning_rate, global_step, optimize_op])
            total_loss += loss_t
            
            if step%100==0:
                print("step: %i, Learning_rate: %f, Time: %is Loss: %f"%(step, lr, int(time.time()-starttime), total_loss/100))
                total_loss = 0.0
                starttime = time.time()
            
            if step%5000==0:
                save_path = saver.save(sess, "model/model.ckpt", global_step=global_step)
                truepredict = 0.0
                truepredict_top5 = 0.0
                valid_count = 0
                while(True):
                    try:
                        acc_valid_1, acc_valid_5, valid_result_t = sess.run([accuracy_valid_batch, accuracy_valid_top_5, output_valid_result])
                        truepredict += acc_valid_1
                        truepredict_top5 += acc_valid_5
                        valid_count += 1
                    except tf.errors.OutOfRangeError:
                        print("valid accuracy of top 1: %f" % (truepredict/valid_count))
                        print("valid accuracy of top 5: %f" % (truepredict_top5/valid_count))
                        break
                starttime = time.time()
                sess.run([valid_init_op])
          
        except tf.errors.OutOfRangeError:
            break

模型调试的心得：

模型的参数初始化的数值很重要，初始化如果不合理，那么可能会造成Loss值不断增大，直至报NaN错误。例如如果全连接层的参数，直接按照正态分布，平均值为0，方差为0.1或0.01来设置，那么Loss值会很大。这一点我当时调试了很长时间，最后把各层的输出打印出来看，发现全连接层的每个输出节点的数值都很大，这是因为全连接层的节点数很多，虽然卷积层的输出数值小，但是经过与全连接层相乘和求和后，每个节点的数值会很大，特别是我们有3个全连接层，不合理的参数初始化会导致最后输出的数值很大，例如有些节点的值会去到上万，那么在进行交叉熵计算时，因为要用Softmax来求概率，e的x次方，当x很大时会超出数值范围。因此合理的优化应该是把方差按照输出节点的个数来进行调整，例如全连接层有4096个节点，那么方差应该设置为1/4096。卷积层的参数初始化同样也很重要，方差的设置应该使得卷积后得出的数值与输入的数值保持在同一个量级。例如论文中建议的每个卷积层的初始化的方差为0.01，但是我发现在代码中如果直接套用这个值，那么Loss值会很难优化，当改为0.1时就好了，这个可能我处理图片后的数值取值范围和论文中处理图片后的数值取值范围不一致所导致的。
除了在全连接层设置Dropout之外，对于全连接层的参数增加L2范数来进行约束也是一个提高准确率的方法。
我的代码中对于图像数据输入的处理和论文有一些不同，论文中是把输入图像调整为256*256的大小后，随机截取其中的224*224的图片。我的代码是如果图像带有BBOX的数据，那么会随机截取一个包含BBOX的图片并调整为224*224大小。我觉得这样效果会更加好，因为很多图片都是带有多个物体的数据的，但是图片的标签只有一个，即对应BBOX的物体，因此应该以BBOX的物体来构成训练图片的主体部分。在对验证集数据进行验证时，论文是对图片调整后取上下左右以及中间5个位置的224*224的图片来分别进行预测再取平均值，我这里简化起见，就直接取中间点的图片了。
模型训练时我设置了每5000个Batch会检验一下在验证集上的Top1和Top5的准确度，如果这个准确度一直没有提高，那么再手动调整Learning Rate，调整为上一个Rate的1/10。
整个Imagenet的训练非常耗时。在我的电脑配置i3 2核4线程+16G RAM+GTX1070i 8G RAM上来跑，每100个Batch需要耗费20秒，查看系统资源占用，发现4个CPU线程的负荷基本都维持在80%左右，GPU的负荷则比较低。看来瓶颈在图片的预处理上，如果有更加多核的CPU，应该能进一步缩短训练的时间。
最终训练50个EPOCH后，TOP5的准确率为70%多，TOP1的准确率为46%多。其中前30个EPOCH的学习率为0.01,30-40的学习率为0.001,40-50EPOCH的学习率为0.0001。测试结果不如论文提到的，论文中的训练结果是90个EPOCH后TOP5的准确度为82%左右。但是我在训练到50个EPOCH已无法再提高准确率了。这个有待以后继续研究。

Tensorflow实现Alexnet对Imagenet的训练与评测

猜你喜欢