Convolutional Neural Network (CNN) for handwritten digit recognition

Original link: https://blog.csdn.net/polyhedronx/article/details/94476824

1. Introduction The
previous blog post uses a fully connected neural network with a single hidden layer, combined with some optimization strategies of neural networks, such as exponential decay learning rate, regularization, Relu activation function and Adam optimization algorithm. It contains 100 hidden layers. The neural network of neurons achieves 98% accuracy of handwritten digit recognition on the MNIST data set. However, fully connected neural networks also have limitations. Even if a deep network, many hidden nodes, and a large number of iterations are used, it is difficult to obtain an accuracy of more than 99% on the MNIST data set. However, the emergence of convolutional neural networks has solved this problem, and finally can reach an accuracy of more than 99%, which meets the needs of some high-precision recognition systems.

2. The basic principle of convolutional neural network
2.1
Each pixel in the convolution operation image is closely related to the surrounding pixels, but it is not necessarily related to pixels that are too far away. This is human vision The concept of receptive fields, each receptive field only receives signals from a small area. The pixels in this small area are related to each other. Each neuron does not need to receive all the pixel information, but only needs to receive local pixels as input, and then integrate the local information received by these neurons. Get global information.

The image convolution operation refers to starting from the upper left corner of the image, using a convolution template to slide on the image, and multiplying the pixel gray value on the image pixel by the value on the corresponding convolution kernel at each position. The sum of all the multiplied results is used as the convolution result value corresponding to the center pixel of the convolution kernel, and the process of sliding to obtain the convolution result is completed in all positions of the image according to this step. This convolution template is usually called a convolution kernel, or filter, in a convolutional neural network. The following figure shows a schematic diagram of a part of the image convolution operation process. The figure uses a 3 to 3 convolution kernel to perform a 5 to 5 size picture. Convolution operation.

The image convolution operation can be expressed as, where:

Among them, is the matrix of the image to be convolved, is the convolution kernel function, and is the output image of the image convolution operation. In deep learning, both the input image matrix and the output result matrix are called feature maps.

2.2 Pooling operation
After the two-dimensional feature map is obtained through the convolutional layer, the size of the feature map is still very large under normal circumstances. If these features are directly sent to the classifier for classification, the amount of calculation will be very large. In addition, there may also be over-fitting problems, so it is not convenient to directly use these feature maps for classification. Pooling operation is a technology designed to solve such problems. It aggregates and counts features at different positions on the feature map matrix to condense features.

The schematic diagram of the pooling operation is shown in the figure above. There are two commonly used pooling operations, one is average pooling, the output of the average pooling operation is the average value of the features in the input feature map within the corresponding range of the pooling core; the other is the maximum pooling, the maximum pooling operation The output is the maximum value of the input feature map within the corresponding range of the pooling kernel. It can be seen that the pooling operation is a special image convolution operation.

Pooling operation can significantly improve the effect of the convolutional neural network, which is mainly due to the feature concentration, the reduced feature map dimension, and the frequent over-fitting phenomenon of the convolutional neural network will be correspondingly reduced. Therefore, since the feature information in a certain range is condensed, the pooling operation also has the effect of enhancing the translation invariance in a small range in the convolutional neural network.

2.3 Convolutional layer The
general convolutional neural network is composed of multiple convolutional layers, and the following operations are usually performed in each convolutional layer.

The image is filtered by multiple different convolution kernels and biased to extract local features. Each convolution kernel will map a new 2D image.
The filtering output result of the previous convolution kernel is processed with a non-linear activation function. At present, the ReLU function is generally used.
The result of the activation function is then pooled (ie down-sampling, for example, 2 2 images are reduced to 1 1 images). Currently, maximum pooling is generally used to retain the most significant features and improve the distortion tolerance of the model.
Note that there are generally multiple different convolution kernels in a convolutional layer, because each convolution kernel can only extract one image feature, and the number of convolution kernels can be increased to extract more features. Each convolution kernel corresponds to a new image mapped out after filtering, and each pixel in the same new image comes from the same convolution kernel, which is the weight sharing of the convolution kernel. The purpose of sharing the weight parameters of the convolution kernel is very simple, reducing the complexity of the model, reducing overfitting and reducing the amount of calculation.

The number of weights that the convolutional layer needs to train is only related to the size of the convolution kernel and the number of convolution kernels. We can use very few parameters to process pictures of any size. The features extracted by each convolutional layer are abstractly combined into higher-order features in subsequent layers.

2.4 Convolutional Neural Network
The difference between convolutional neural network and multilayer perceptron network is that the convolutional neural network contains several feature extractors composed of convolutional layers and pooling layers, which can effectively reduce the number of parameters and greatly reduce the number of parameters. Simplify the complexity of the model, thereby reducing the risk of overfitting. At the same time, it gives the convolutional neural network tolerance to translation and slight deformation, and improves the generalization ability of the model.

The structure of the famous LeNet5 is shown in the figure below, which contains three convolutional layers, a fully connected layer and a Gaussian connected layer. Generally speaking, different convolutional neural networks can be reasonably designed for different data sets and different sizes of input images to deal with different practical problems.

3. The design of convolutional neural network for handwritten digit recognition problem The
handwritten digit recognition problem is relatively simple, so two convolutional layers and a fully connected layer are used to construct a simple convolutional neural network.

3.1 Interpretation of main TensorFlow functions
(1) tf.nn.conv2d

Given a 4-dimensional input and filter Tensor, calculate a 2-dimensional convolution.

tf.nn.conv2d(
    input,
    filter=None,
    strides=None,
    padding=None,
    use_cudnn_on_gpu=True,
    data_format='NHWC',
    dilations=[1, 1, 1, 1],
    name=None,
    filters=None
)

input:
Refers to the input image that needs to be convolved. It is a 4-dimensional Tensor. The type is one of half, bfloat16, float32 and float64. The dimension order is set according to data_format, and the default is [batch, in_height, in_width, in_channels]. The specific meaning of the shape is [the number of images in a batch during training, the height of the image, the width of the image, the number of image channels].

filter: It is
equivalent to the convolution kernel in CNN. It requires a Tensor, which must be the same as the input type. With a shape like [filter_height, filter_width, in_channels, out_channels], the specific meaning is [the height of the convolution kernel, the width of the convolution kernel, the number of image channels, the number of convolution kernels]. Note that the third dimension in_channels is the fourth dimension of the parameter input.

strides: The
step size of the sliding window in each dimension of the image during convolution. This is a one-dimensional vector, the dimension order is set according to data_format, the default is [NHWC], the type is int or int list, and the length is 1, 2 or 4. N and C are set to 1 by default, and the general format is [1, stride[1], stride[2], 1]. In most cases, because the steps on height and width are set to the same, it is usually [1, stride, stride, 1].

padding:
The amount of string type, which can only be one of "SAME" or "VALID", indicating whether padding is required. Because the output size after convolution is generally smaller than the input, padding can be used to obtain an output with the same size as the input at this time.

 strides=[1, 1, 1, 1], padding="VALID"          strides=[1, 1, 1, 1], padding="SAME"

use_cudnn_on_gpu:
bool type, whether to use cudnn acceleration, the default is true.

data_format
specifies the format of input and output data. Optional string type, one of "NHWC" or "NCHW", the default is "NHWC". When using the default format "NHWC", the data is stored in the following order: [batch, height, width, channels].

dilations The dilation
factor of each input dimension. This is a one-dimensional vector, the dimension order is set according to data_format, the default is [NHWC], the type is int or int list, the length is 1, 2 or 4, and the value is set to all 1s by default. If a single value is given, it is copied to H and W.

Given an input tensor [batch, in_height, in_width, in_channels] and a filter/kernel tensor [filter_height, filter_width, in_channels, out_channels], perform the following operations:

The flattened filter is a two-dimensional matrix with shape [filter_height * filter_width * in_channels, output_channels].
Extract the picture block from the input according to the filter size to form a virtual tensor of size [batch, out_height, out_width, filter_height * filter_width * in_channels].
For each image block, multiply the filter matrix to the right.
The calculation formula is:

Some examples:

Enter [1,3,3,1], the filter is [2,2,1,1], padding='SAME', the filling method is as shown in the figure:

Enter [1,2,2,1], the filter is [3,3,1,1], padding='SAME', and the filling method is as shown in the figure:

For multi-channel, the input [1x3x3x2] is a 3x3 image with 2 channels, the filter is [2x2x2x1], the step size is 1, padding=VALID, and the output is [1x2x2x1], as shown in the figure:

(2)tf.nn.max_pool

Performing the maximum pooling operation on Input can be regarded as a special convolution operation.

tf.nn.max_pool(
    value,
    ksize,
    strides,
    padding,
    data_format='NHWC',
    name=None,
    input=None
)

Value
needs to be pooled input. Generally, the pooling layer is followed by the convolutional layer, so the input is usually a feature map, which is still a shape like [batch, height, width, channels].


The size of the ksize pooling window is a four-dimensional vector, generally [1, height, width, 1], because we don't want to pool on batch and channels, so these two dimensions are set to 1.

one example:

Assuming that the input is a two-channel graph, that is, value=[1,4,4,2]; the pooling window size is ksize=[1,2,2,1], and the results are as follows.

test program:

import tensorflow as tf
 
a = tf.constant([
    [[1.0, 2.0, 3.0, 4.0],
     [5.0, 6.0, 7.0, 8.0],
     [8.0, 7.0, 6.0, 5.0],
     [4.0, 3.0, 2.0, 1.0]],
    [[4.0, 3.0, 2.0, 1.0],
     [8.0, 7.0, 6.0, 5.0],
     [1.0, 2.0, 3.0, 4.0],
     [5.0, 6.0, 7.0, 8.0]]
])
 
a = tf.reshape(a, [1, 4, 4, 2])
 
pooling = tf.nn.max_pool(value=a, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
with tf.Session() as sess:
    print("image:")
    image = sess.run(a)
    print(image)
    print("reslut:")
    result = sess.run(pooling)
    print(result)

(3)tf.nn.dropout

Randomly set some nodes to 0. During training, we randomly discard some of the node data to reduce over-fitting, and keep all data during prediction to obtain better prediction performance. Generally used in the fully connected layer.

In TensorFlow, its parameters are defined as follows:

tf.nn.dropout(
    x,
    keep_prob=None,
    noise_shape=None,
    seed=None,
    name=None,
    rate=None
)

Among them, the parameter x represents the input, which is a floating-point tensor. keep_prob is the probability that the neuron is selected, and rate is the probability that the elements in x are discarded. Obviously, keep_prob=1-rate (the official website says keep_prob will be discarded, and it is recommended to use rate instead). Keep_prob is a placeholder during initialization, keep_prob = tf.placeholder(tf.float32), tensorflow sets the specific value of keep_prob at run time, for example, keep_prob: 0.5. noise_shape is a one-dimensional int32 type tensor, representing a randomly generated shape with the "reserve/discard" flag. seed is the random number seed, which is an integer.

For each element in the input x, output 0 with the probability rate, otherwise the input will be scaled up by 1/(1-rate) times in order to keep the expectation of the output sum unchanged.

By default, each element is kept or deleted independently. If noise_shape is specified, only the dimensions of noise_shape[i] == shape(x)[i] are independent (the elements in noise_shape can only be 1 or the corresponding element in x.shape). For example, if shape(x) = [k, l, m, n] and noise_shape = [k, 1, 1, n], each batch and channel will remain independent, and each row or column will either be all reserved or all Zero.

Some examples are given below:

Assuming that there are two pictures in a batch input, each picture is a dual-channel picture with a size of 3×3, that is, x=[2,3,3,2]; keep_prob=0.5, and the result is as follows.

test program:

import tensorflow as tf
 
b = tf.constant([
    [[1.0, 2.0, 3.0],
     [4.0, 5.0, 6.0],
     [7.0, 8.0, 9.0]],
    [[9.0, 8.0, 7.0],
     [6.0, 5.0, 4.0],
     [3.0, 2.0, 1.0]],
    [[1.0, 2.0, 3.0],
     [4.0, 5.0, 6.0],
     [7.0, 8.0, 9.0]],
    [[9.0, 8.0, 7.0],
     [6.0, 5.0, 4.0],
     [3.0, 2.0, 1.0]]
])
 
b = tf.reshape(b, [2, 3, 3, 2])
 
drop = tf.nn.dropout(x=b, keep_prob=0.5, noise_shape=[2, 3, 3, 2])
with tf.Session() as sess:
    print("image:")
    image = sess.run(b)
    print(image)
    print("result:")
    result = sess.run(drop)
    print(result)

noise_shape=[2,3,3,2] or noise_shape=None, the output is randomly set to zero, and other numbers are magnified by 1/(1-0.5)=2 times:

noise_shape=[1,3,3,2], the zeroing pattern is the same among different pictures in the output batch:

noise_shape=[2,1,3,2], the zeroing pattern between different output lines is the same:

noise_shape=[2,3,1,2], the zeroing pattern between different output columns is the same:

noise_shape=[2,3,3,1], the zeroing pattern between output different channels is the same:

3.2 Program and results The
program running version is: python–>3.7.3, tensorflow–>1.13.1

The weight is initialized as a random number that obeys a truncated normal distribution with a standard deviation of 0.1. Because the ReLU function is used, bias is initialized to a constant value of 0.1 to avoid dead nodes.

The input image is a 28×28 grayscale image. The size of the first convolutional layer convolution kernel is set to 5×5, 1 color channel, the number of convolution kernels (output channel) is set to 32 (that is, how many types of features this convolutional layer will extract), width The moving steps of and high are both 1, and padding is performed (padding="SAME", the output image size is the same as the input image); the activation function is the ReLU function; the window size of the pooling operation is 2×2, the width and height are The moving steps are both 2.

The size of the second convolutional layer convolution kernel is set to 5×5, the input channel is 32 (that is, the output channel of the previous layer), the number of convolution kernels (output channel) is set to 64, and the other parameters are the same as those of the first volume. The buildup is the same.

After the previous two maximum pooling with a step size of 2×2, the side length becomes 1/4 of the original, that is, the picture size is changed from 28×28 to 7×7. The number of convolution kernels of the second convolution layer is 64, and the size of the output tensor is 7×7×64. Convert it into a one-dimensional vector, and then connect a fully connected layer, the hidden node is 1024, and use the ReLU activation function.

In order to reduce over-fitting, a Dropout layer is used below. Keep_prob is set to 0.5 during training and 1 during testing. The output of the Dropout layer is connected to a softmax layer to get the final probability output.

The loss function is defined as cross entropy, the optimizer uses Adam, and the learning rate is set to 1e-4. The batch size is 50, and the number of training is 20000. The procedure is as follows.

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
import matplotlib.pyplot as plt
 
# move warning
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
old_v = tf.logging.get_verbosity()
tf.logging.set_verbosity(tf.logging.ERROR)
 
 
# weight initialization
def weight_variable(shape):
    return tf.Variable(tf.truncated_normal(shape, stddev=0.1))
 
 
# bias initialization
def bias_variable(shape):
    return tf.Variable(tf.constant(0.1, shape=shape))
 
 
# convolution operation
def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding="SAME")
 
 
# pooling operation
def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding="SAME")
 
 
# Convolutional Neural Network
def cnn2(x):
    x_image = tf.reshape(x, [-1, 28, 28, 1])
 
    # Layer 1: convolutional layer
    W_conv1 = weight_variable([5, 5, 1, 32])
    b_conv1 = bias_variable([32])
    h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
    h_pool1 = max_pool_2x2(h_conv1)
 
    # Layer 2: convolutional layer
    W_conv2 = weight_variable([5, 5, 32, 64])
    b_conv2 = weight_variable([64])
    h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
    h_pool2 = max_pool_2x2(h_conv2)
 
    # Layer 3: full connection layer
    W_fc1 = weight_variable([7 * 7 * 64, 1024])
    b_fc1 = bias_variable([1024])
    h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
    h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
 
    # dropout layer
    keep_prob = tf.placeholder("float")
    h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
 
    # output layer
    W_fc2 = weight_variable([1024, 10])
    b_fc2 = bias_variable([10])
    y_conv = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
 
    return y_conv, keep_prob
 
 
# read data
mnist = input_data.read_data_sets('MNIST_data/', one_hot=True)
 
# input layer
x = tf.placeholder("float", shape=[None, 784])
y_ = tf.placeholder("float", shape=[None, 10])
 
# cnn
y_conv, keep_prob = cnn2(x)
 
# loss function & optimization algorithm
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_*tf.log(y_conv), reduction_indices=[1]))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
 
# new session
sess = tf.Session()
sess.run(tf.global_variables_initializer())
 
# train
losss = []
accurs = []
steps = []
correct_predict = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_predict, "float"))
for i in range(20000):
    batch = mnist.train.next_batch(50)
    sess.run(train_step, feed_dict={
    
    x: batch[0], y_: batch[1], keep_prob: 0.5})
 
    if i % 100 == 0:
        loss = sess.run(cross_entropy, {
    
    x: batch[0], y_: batch[1], keep_prob: 1.0})
        accur = sess.run(accuracy, feed_dict={
    
    x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0})
        steps.append(i)
        losss.append(loss)
        accurs.append(accur)
        print('Steps: {} loss: {}'.format(i, loss))
        print('Steps: {} accuracy: {}'.format(i, accur))
 
# plot loss
plt.figure()
plt.plot(steps, losss)
plt.xlabel('Number of steps')
plt.ylabel('Loss')
 
plt.figure()
plt.plot(steps, accurs)
plt.hlines(1, 0, max(steps), colors='r', linestyles='dashed')
plt.xlabel('Number of steps')
plt.ylabel('Accuracy')
plt.show()
 
tf.logging.set_verbosity(old_v)

The curve of the accuracy of Loss and the test set with the number of training is shown in the figure below, and the final accuracy is about 99.2%.
Insert picture description here

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_42293496/article/details/110005450