解析TensorFlow下的卷积神经网络

首先，对卷积神经网络操作作出一个简单的回顾：
卷积神经网络通常采用若干个卷积和子采样层的叠加结构作为特征抽取器。卷积层与子采样层不断将特征图缩小，但是特征图的数量往往增多。特征抽取器后面接一个分类器，分类器通常由一个多层感知机构成。在特征抽取器的末尾，我们将所有的特征图展开并排列成为一个向量，称为特征向量，该特征向量作为后层分类器的输入，如下图所示：
这里写图片描述
卷积操作的过程：
卷积过程有三个二维矩阵参与，它们分别是两个特征图和一个卷积核：原图inputX、输出图outputY、卷积核kernelW。卷积过程可以理解为卷积核卷积核kernalW覆盖在原图inputX的一个局部的面上，kernalW对应位置的权重乘于inputX对应神经元的输出，对各项乘积求和并赋值到outputY矩阵的对应位置。卷积核在inputX图中从左向右，从上至下每次移动一个位置，完成整张inputX的卷积过程，如下图所示：
这里写图片描述
子采样有两种方式，一种是均值子采样，一种是最大值子采样，如下图所示：

在最大值子采样中的卷积核中，只有一个值为1，其他值为0，保留最强输入值，卷积核在原图上的滑动步长为2，相当于把原图缩减到原来的1/4。均值子采样卷积核中的每个权重为0.25，保留的是输入图的均值数据。

卷积核的本质是神经元之间相互连接的权重，而且该权重被属于同一特征图的神经元所共享。在实际的网络训练过程中，输入神经元组成的特征图被切割成卷积核大小的小图。每个小图通过卷积核与后层特征图的一个神经元连接。一个特征图上的所有小图和后层特征图中某个神经元的连接使用的是相同的卷积核，也就是同特征图的神经元共享了连接权重。

TensorFlow API构建卷积神经网络

在TensorFlow中，卷积神经网络（Convolutional neural networks，CNNs）主要包含三种类型的组件，主要API如下：

卷积层（Convolutional Layer），构建一个2维卷积层，常用的参数有：

conv = tf.layers.conv2d(
        inputs=pool,
        filters=64,
        kernel_size=[5, 5],
        padding="same",
        activation=tf.nn.relu)

inputs表示输入要的Tensor，filters表示卷积核的数量，kernel_size表示卷积核的大小，padding表示卷积的边界处理方式，有valid和same两种方式，valid方式不会在原有输入的基础上添加新的像素，same表示需要对input的边界数据进行填存，具体计算公式参见
TensorFlow API
注意：要想访问上面的网站，需要进行翻墙。
这里将我查询的结果反映在下面：(这里还没有进行翻译)

tf.nn.convolution(
    input,
    filter,
    padding,
    strides=None,
    dilation_rate=None,
    name=None,
    data_format=None
)

Defined in tensorflow/python/ops/nn_ops.py.

See the guide: Neural Network > Convolution

Computes sums of N-D convolutions (actually cross-correlation).

This also supports either output striding via the optional strides parameter or atrous convolution (also known as >convolution with holes or dilated convolution, based on the French word “trous” meaning holes in English) via the optional >dilation_rate parameter. Currently, however, output striding is not supported for atrous convolutions.

Specifically, in the case that data_format does not start with “NC”, given a rank (N+2) input Tensor of shape

[num_batches, input_spatial_shape[0], …, input_spatial_shape[N-1], num_input_channels],

a rank (N+2) filter Tensor of shape

[spatial_filter_shape[0], …, spatial_filter_shape[N-1], num_input_channels, num_output_channels],

an optional dilation_rate tensor of shape [N] (defaulting to [1]N) specifying the filter upsampling/input downsampling rate, >and an optional list of N strides (defaulting [1]N), this computes for each N-D spatial output position (x[0], …, x[N-1]):

output[b, x[0], ..., x[N-1], k] =
      sum_{z[0], ..., z[N-1], q}
          filter[z[0], ..., z[N-1], q, k] *
          padded_input[b,
                       x[0]*strides[0] + dilation_rate[0]*z[0],
                       ...,
                       x[N-1]*strides[N-1] + dilation_rate[N-1]*z[N-1],
                       q]

where b is the index into the batch, k is the output channel number, q is the input channel number, and z is the N-D spatial offset within the filter. Here, padded_input is obtained by zero padding the input using an effective spatial filter shape of (spatial_filter_shape-1) * dilation_rate + 1 and output striding strides as described in the comment here.

In the case that data_format does start with “NC”, the input and output (but not the filter) are simply transposed as follows:

convolution(input, data_format, kwargs) = tf.transpose(convolution(tf.transpose(input, [0] + range(2,N+2) + [1]), kwargs), [0, N+1] + range(1, N+1))

It is required that 1 <= N <= 3.

Args:

input: An N-D Tensor of type T, of shape [batch_size] + input_spatial_shape + [in_channels] if data_format does not start with “NC” (default), or [batch_size, in_channels] + input_spatial_shape if data_format starts with “NC”.

filter: An N-D Tensor with the same type as input and shape spatial_filter_shape + [in_channels, out_channels].

padding: A string, either “VALID” or “SAME”. The padding algorithm.

strides: Optional. Sequence of N ints >= 1. Specifies the output stride. Defaults to [1]*N. If any value of strides is > 1, then all values of dilation_rate must be 1.

dilation_rate: Optional. Sequence of N ints >= 1. Specifies the filter upsampling/input downsampling rate. In the literature, the same parameter is sometimes called input stride or dilation. The effective filter size used for the convolution will be spatial_filter_shape + (spatial_filter_shape - 1) * (rate - 1), obtained by inserting (dilation_rate[i]-1) zeros between consecutive elements of the original filter in each spatial dimension i. If any value of dilation_rate is > 1, then all values of strides must be 1.

name: Optional name for the returned tensor.

data_format: A string or None. Specifies whether the channel dimension of the input and output is the last dimension (default, or if data_format does not start with “NC”), or the second dimension (if data_format starts with “NC”). For N=1, the valid values are “NWC” (default) and “NCW”. For N=2, the valid values are “NHWC” (default) and “NCHW”. For N=3, the valid values are “NDHWC” (default) and “NCDHW”.

Returns:
A Tensor with the same type as input of shape

`[batch_size] + output_spatial_shape + [out_channels]`

if data_format is None or does not start with “NC”, or

`[batch_size, out_channels] + output_spatial_shape`

if data_format starts with “NC”, where output_spatial_shape depends on the value of padding.

If padding == “SAME”: output_spatial_shape[i] = ceil(input_spatial_shape[i] / strides[i])

If padding == “VALID”: output_spatial_shape[i] = ceil((input_spatial_shape[i] - (spatial_filter_shape[i]-1) * dilation_rate[i]) / strides[i]).

Raises:

ValueError: If input/output depth does not match filter shape, if padding is other than “VALID” or “SAME”, or if data_format is invalid.

activation表示要采用的激活函数。

池化层（Pooling Layer，max_pooling2d或average_pooling2d），用于构建2维池化，常用的参数有：

tf.layers.max_pooling2d(
inputs=conv, 
pool_size=[2, 2], 
strides=2)

inputs表示要被池化的输入Tensor，pool_size表示池化窗口大小，strides表示进行池化操作的步长。

全链接层（Dense Layer，dense），主要对特性向量执行分类操作。执行全链接操作前，需要对池化后的特性向量，执行展开操作，转换成[batch_size, features]的形式，如下所示：
tf.reshape(pool, [-1, 7 * 7 * 64])，-1表示BatchSize，

全链接层主要参数如下所示：

tf.layers.dense(inputs=pool2_flat, 
units=1024, 
activation=tf.nn.relu)

inputs表示输入层，units表示输出层的tensor的形状为[batchsize, units]，activation表示要采用的激化函数。

使用TensorFlow API构建卷积神经网络的示例代码，如下所示:

# 输入层
# 改变输入数据维度为 4-D tensor: [batch_size, width, height, channels]
# 图像数据为 28x28 像素大小, 并且为单通道
input_layer = tf.reshape(features, [-1, 28, 28, 1])

# 卷积层1
# 卷积核大小为5x5，卷积核数量为32， 激活方法使用RELU
# 输入Tensor维度: [batch_size, 28, 28, 1]
# 输出Tensor维度: [batch_size, 28, 28, 32]
conv1 = tf.layers.conv2d(
    inputs=input_layer,
    filters=32,
    kernel_size=[5, 5],
    padding="same",
    activation=tf.nn.relu)

# 池化层1
# 采用2x2维度的最大化池化操作，步长为2
# 输入Tensor维度: [batch_size, 28, 28, 32]
# 输出Tenso维度: [batch_size, 14, 14, 32]
pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)

#卷积层2
#卷积核大小为5x5，卷积核数量为64， 激活方法使用RELU.
#输入Tensor维度: [batch_size, 14, 14, 32]
#输出Tensor维度: [batch_size, 14, 14, 64]
conv2 = tf.layers.conv2d(
    inputs=pool1,
    filters=64,
    kernel_size=[5, 5],
    padding="same",
    activation=tf.nn.relu)

#池化层2
#采用2x2维度的最大化池化操作，步长为2
#输入Tensor维度: [batch_size, 14, 14, 64]
#输出Tensor维度: [batch_size, 7, 7, 64]
pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)


# 展开并列池化层输出Tensor为一个向量
#输入Tensor维度: [batch_size, 7, 7, 64]
#输出Tensor维度: [batch_size, 7 * 7 * 64]
pool2_flat = tf.reshape(pool2, [-1, 7 * 7 * 64])

# 全链接层
# 该全链接层具有1024神经元
#输入Tensor维度: [batch_size, 7 * 7 * 64]
#输出Tensor维度: [batch_size, 1024]
dense = tf.layers.dense(inputs=pool2_flat, units=1024, activation=tf.nn.relu)

#对全链接层的数据加入dropout操作，防止过拟合 
#40%的数据会被dropout, 
dropout = tf.layers.dropout(
    inputs=dense, rate=0.4, training=mode == learn.ModeKeys.TRAIN)

# Logits层，对dropout层的输出Tensor，执行分类操作
#输入Tensor维度: [batch_size, 1024]
#输出Tensor维度: [batch_size, 10]
logits = tf.layers.dense(inputs=dropout, units=10)

TensorFlow Cifar10模型

CIFAR-10，http://www.cs.toronto.edu/~kriz/cifar.html，是图片识别的benchmark问题，主要对RGB为32*32的图像进行10分类，类别包括：airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck。其中包括50000张训练图片，10000张测试图片。
这里写图片描述
TensorFlow Cifar10，https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10，模型包含1,068,298个参数，单个图片的推导包含19.5M个乘法/加法运算。该模型在GPU上运行几个小时后，测试精确度会达到86%。
模型特性主要包括：

核心数学组件：卷积操作，RELU激活算子，池化操作，局部响应归一化操作。
可视化展示：展示训练过程的loss值，梯度，以及参数分布情况等。
滑动平均：使用参数的滑动平均值执行评估操作。
预处理队列：通过队列对训练数据进行预处理，用于减少读取数据的延迟，加快数据的预处理。

该模型的代码结构如下：

cifar10_input.py：负责加载训练数据。
cifar10.py：负责构建cifar10模型。
cifar10_train.py：负责在单设备（CPU/GPU）上进行训练。
cifar10_multi_gpu_train.py：负责在多GPU上进行训练。
cifar10_eval.py：负责对模型进行评估。

该模型的Graph结构如下：
这里写图片描述
下图为Cifar10的多GPU模型架构，每个GPU型号最好相同，具备足够的内存能运行整个Cifar10模型。

该架构会复制Cifar10模型到每个GPU上，每个GPU上训练完一个Batch的数据后，在CPU端对梯度执行同步操作（求均值），更新训练参数，然后把模型参数发送给每个GPU，进行下一个Batch数据的训练。

TensorFlow VGG19模型

VGG网络与AlexNet类似，也是一种CNN，VGG在2014年的 ILSVRC localization and classification 两个问题上分别取得了第一名和第二名。VGG网络非常深，通常有16－19层，卷积核大小为 3 x 3，16和19层的区别主要在于后面三个卷积部分卷积层的数量。可以看到VGG的前几层为卷积和maxpool的交替，后面紧跟三个全连接层，激活函数采用Relu，训练采用了dropout。VGG中各模型配置如下, 其中VGG19的top-1的训练精度可达到71.1%，top-5的训练精度可达到89.8%。模型结构示例如下：
这里写图片描述

TensorFlow Vgg19的模型示例如下，VGG19模型

#卷积操作和池化操作
      net = slim.repeat(inputs, 2, slim.conv2d, 64, [3, 3], scope='conv1')
      net = slim.max_pool2d(net, [2, 2], scope='pool1')
      net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], scope='conv2')
      net = slim.max_pool2d(net, [2, 2], scope='pool2')
      net = slim.repeat(net, 4, slim.conv2d, 256, [3, 3], scope='conv3')
      net = slim.max_pool2d(net, [2, 2], scope='pool3')
      net = slim.repeat(net, 4, slim.conv2d, 512, [3, 3], scope='conv4')
      net = slim.max_pool2d(net, [2, 2], scope='pool4')
      net = slim.repeat(net, 4, slim.conv2d, 512, [3, 3], scope='conv5')
      net = slim.max_pool2d(net, [2, 2], scope='pool5')
      net = slim.conv2d(net, 4096, [7, 7], padding=fc_conv_padding, scope='fc6')
      #dropout操作，防止过拟合
      net = slim.dropout(net, dropout_keep_prob, is_training=is_training,
                         scope='dropout6')
      net = slim.conv2d(net, 4096, [1, 1], scope='fc7')
      net = slim.dropout(net, dropout_keep_prob, is_training=is_training,
                         scope='dropout7')
      net = slim.conv2d(net, num_classes, [1, 1],
                        activation_fn=None,
                        normalizer_fn=None,
                        scope='fc8')

总结

回顾了深度卷积神经网络的特征图、卷积核、池化操作、全连接操作等基本概念。用TensorFlow API 构建卷积神经网络，包括卷及操作API，池化操作API以及全连接操作。同时又对图片识别的Cifar10以及VGG19的架构和代码。后续还会进行一下模型的介绍。如VOC等。同时，利用自己的图片数据来标记物体，同时进行训练，得出自己的模型。