Detailed explanation of convolutional neural network (CNN) and TensorFlow2 code implementation

The name of the convolutional neural network sounds scary, and this article explains it in an easy-to-understand way. Everyone can understand.


What is convolution

Convolutional neural network is the technology that traditional neural network uses matrix convolution.
Two-dimensional linear convolution:
matrix example:
Insert picture description here

( Part of the content is excerpted from this article )
Now there is a picture (left in the figure below) and a kernel core (in the middle of the figure below). The result on the right of the figure below can be obtained by convolution.
We know that a picture is actually a huge numerical matrix, which is what we often call pixels. A gray picture is a huge two-dimensional matrix. Each element in the matrix represents the degree of black and white, which can be understood as a mathematical matrix.
Insert picture description here

Two-dimensional convolution: get the first value from the two-dimensional matrix, that is, the position of the marked red box 1 in the figure, and the center position of the upper convolution kernel, which is the red box in the figure (convolution The kernel is a matrix), the value of the corresponding position is multiplied by the matrix element and finally summed, the value obtained is the new value -8 (for the calculation method, see the matrix convolution operation above), and so on, get all positions For the new value, fill in 0 in the outermost layer of the final image. The size obtained by two-dimensional convolution depends on the convolution kernel, and the size of the convolution kernel is generally an odd size (for example, 3×3, 5×5), which means that the pixel value depends on the pixel value of the surrounding circle, and the weight depends on the volume. The product core decides to extract the specific features of the area.


1. Introduction to Convolutional Neural Networks

Convolutional Neural Networks (CNN) can be used to make machines visualize things and perform tasks such as image classification, image recognition, object detection, and instance segmentation. This is the most common area of ​​CNN, such as handwriting recognition.

Convolutional layer-extract local image features

Insert picture description here
The picture has three color channels, RGB, so the input is 3 layers. Because of the 3 input channels (red R, green G, and blue B), any image we see is a 3-channel image. It can be understood as 3 layers, so the convolution kernel is also 3 layers, which is equivalent to multiplying the 9 elements corresponding to two 3*3 Rubik's cubes, and adding the last 9 products. Therefore, the convolution kernel is equivalent to using 3 layers of two-dimensional (length and width) filters, and the number of channels (layers) of the image is the same as the number of filter layers.
Similar to the 2D convolution operation, we will slide the filter in the horizontal direction. Every time we move the filter, we will get the weighted average of the three channels (3 layers) of the entire image, that is, the weighted neighborhood of RGB values. Since we only slide the kernel in two dimensions-left to right and top to bottom, the output of this operation will be a 2D output.
Suppose we have a 2D input with a size of 7x7, and we are applying a 3x3 filter on the image starting from the upper left corner of the image. When we slide the kernel on the image from left to right and top to bottom, it is obvious that the output is smaller than the input, which is 5x5.
Insert picture description here
What if we want the output to be the same size as the input?
If the size of the original input is 7x7, we also want the output size to be 7x7. So, in that case, what we can do is to evenly add an artificial padding around the input (the value is zero), so that we can place the filter K (3x3) on the image pixels and calculate the weight of the neighbors average value.
A convolution kernel is to extract a feature. Therefore, in order to fully extract the features of the picture, we need multiple convolution kernels to extract the features of the picture. This is called the depth of the convolution kernel. The result is multiple 2D outputs, stacked in Together, there are multiple layers of output. As shown in the figure: After
Insert picture description here
understanding this picture, you can understand the architecture diagram of the last convolutional neural network. Multi-layer convolution will cause the number of layers of this color to increase.

Expansion-padding, keep the length and width of the picture unchanged after convolution

By adding a circle (zero) around the input, we can keep the shape of the output the same as the input. If we have a larger filter K (5x5), then the number of zero padding we need to apply will also increase so that we can keep the same output size. In this process, the output size is the same as the output size, so it is named Padding. See this link for the original text
Insert picture description here

Pooling layer—reduce dimensions, reduce model complexity and computational complexity

Having obtained the feature map, usually we will perform an operation called Pooling operation. The number of hidden layers required to learn the complex relationships existing in the image will be large. We apply pooling operations to reduce the representation of input features, thereby reducing the computing power required by the network.
Once the input feature map is obtained, we will apply a filter of certain shape on the feature map to obtain the maximum value from that part of the feature map. This is called max pooling. This is also called subsampling, because we are sampling a maximum value from the entire part of the feature map covered by the kernel.
Insert picture description here
Insert picture description here

Flatten flattening-make multi-dimensional data into a huge one-dimensional vector

We got multiple pink convolution results, which are multi-dimensional, as shown in the last figure of the convolution layer section.
But the result of our prediction is one-dimensional, such as two-category, either 0 or 1. How can multi-dimensional data get a one-dimensional output?
Simple, flatten all the multi-dimensional data into a one-dimensional array, just like you break many Rubik's Cubes one by one and arrange them in a row. The Rubik's Cube is multi-dimensional. Don't you just turn multiple multi-dimensional arrays into one-dimensional arrays?
Insert picture description here
Insert picture description here

Fully connected layer-output result

Once we have performed a series of convolution and pooling operations on the feature representation of the image (maximum merging or average merging, also known as downsampling). We flatten the output of the final pooling layer into a vector, and pass it through a fully connected layer (feedforward neural network) with a different number of hidden layers, and finally a multi-layer deep neural network for fitting.
Finally, the output of the fully connected layer will pass through the Softmax layer of the required size. The Softmax layer outputs a vector of probability distributions, which helps perform image classification tasks. In the digit recognizer problem (shown above), the output softmax layer has 10 neurons, which can classify the input into one of 10 categories (0-9 numbers). If it is a two-class classification problem, then the Softmax layer is 2 neurons, outputting 0 and 1, respectively, so the final Softmax layer is determined according to how many classes the final result needs to be divided into.
Insert picture description here
If it is a 2-category, the last softmax achievement has only two neurons, indicating a 2-category output.Insert picture description here

Two, TensorFlow2 code implementation

1. Import data

We use the mnist that comes with TensorFlow2 to test the handwritten numbers 0-9, and then determine which number he wrote.
To import data, create a new MNISTLoader class.
The code is as follows (named testData.py):

import numpy as np
import tensorflow as tf

class MNISTLoader():
    def __init__(self):
        mnist = tf.keras.datasets.mnist
        (self.train_data,self.train_label),(self.test_data,self.test_label) = mnist.load_data()
        # MNIST中的图像默认为uint8(0-255的数字)。以下代码将其归一化到0-1之间的浮点数,并在最后增加一维作为颜色通道RGB,如果没有这个维度就是灰度的图片,没有彩色。
        self.train_data = np.expand_dims(self.train_data.astype(np.float)/255.0,axis=-1)  # [60000, 28, 28, 1]
        self.test_data = np.expand_dims(self.test_data.astype(np.float32) / 255.0, axis=-1)        # [10000, 28, 28, 1]
        self.train_label = self.train_label.astype(np.int32)    # [60000]
        self.test_label = self.test_label.astype(np.int32)      # [10000]
        self.num_train_data, self.num_test_data = self.train_data.shape[0], self.test_data.shape[0]   #60000,10000

    def get_batch(self, batch_size):
        # 从数据集中随机取出batch_size个元素并返回
        index = np.random.randint(0, self.num_train_data, batch_size)  #可以重复取某条数据
        return self.train_data[index, :], self.train_label[index]

# mnist = MNISTLoader()
# batch_size = 1
# train_data,train_label = mnist.get_batch(batch_size)
# print(train_data*255)
# print(train_label)
# print(train_data[0,:,1])

2. Build a CNN network with TensorFlow2

The code structure is as follows:
1. Define the hyperparameters
2. Set the model structure
3. Train the model
4. Predict the test set and test the accuracy

import numpy as np
import tensorflow as tf
from testData import *
import time

class CNN(tf.keras.Model):
    def __init__(self):
        super().__init__()
        self.conv1= tf.keras.layers.Conv2D(
            filters=32, #卷积核的数目32,提取32维特征
            kernel_size=[5,5], #感觉野,卷积核的长和宽
            padding='same', #padding 策略 (vaild、same)
            activation= tf.nn.relu #激活函数

        )
        self.pool1 = tf.keras.layers.MaxPool2D(pool_size=[2,2],strides=2) #池化层一般用2X2矩阵
        self.conv2 = tf.keras.layers.Conv2D(
            filters=64,
            kernel_size=[5,5],
            padding='same',
            activation=tf.nn.relu
        )
        self.pool2 = tf.keras.layers.MaxPool2D(pool_size=[2,2],strides=2) #池化层一般用2X2矩阵
        self.flatten = tf.keras.layers.Reshape(target_shape=(7*7*64,))  #把二维的矩阵展平为1维
        self.dense1 = tf.keras.layers.Dense(units=1024,activation=tf.nn.relu) #第一层全连接层,1024个神经元
        self.dense2 = tf.keras.layers.Dense(units=10) #最后一层全连接层,激活函数要用softmax,神经元数量为分类数量

    def call(self,inputs):
        x = self.conv1(inputs) #经过第一个卷积层
        x = self.pool1(x)  #经过第一个池化层,下采样
        x = self.conv2(x)  #经过第二个卷积层
        x = self.pool2(x)  #经过第二个池化层,下采样
        x = self.flatten(x)  #把中间结果拉平成一个大的一维向量
        x = self.dense1(x)   #经过第一个全连接层
        x = self.dense2(x)   #结果第二个全连接层,也是最后一层,叫softmax层
        output = tf.nn.softmax(x)
        return output

#主控程序,调用数据并训练模型
#定义超参数
num_epochs = 5  #每个元素重复训练的次数
batch_size = 50
learning_rate = 0.001

print('now begin the train, time is ')
print(time.strftime('%Y-%m-%d %H:%M:%S',time.localtime()))
model = CNN()
data_loader = MNISTLoader()
optimier = tf.keras.optimizers.Adam(learning_rate=learning_rate)

num_batches = int(data_loader.num_train_data//batch_size*num_epochs)
for batch_index in range(num_batches):
    X,y = data_loader.get_batch(batch_size)
    with tf.GradientTape() as tape:
        y_pred = model(X)
        loss = tf.keras.losses.sparse_categorical_crossentropy(y_true=y,y_pred=y_pred)
        loss = tf.reduce_sum(loss)
        print("batch %d: loss %f"%(batch_index,loss.numpy()))
    grads = tape.gradient(loss,model.variables)
    optimier.apply_gradients(grads_and_vars=zip(grads,model.variables))

print('now end the train, time is ')
print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime()))
#模型的评估
sparse_categorical_accuracy = tf.keras.metrics.SparseCategoricalAccuracy()
num_batches_test = int(data_loader.num_test_data//batch_size)  #把测试数据拆分成多批次,每个批次50张图片
for batch_index in range(num_batches_test):
    start_index,end_index = batch_index*batch_size,(batch_index+1)*batch_size
    y_pred = model.predict(data_loader.test_data[start_index:end_index])
    sparse_categorical_accuracy.update_state(
        y_true = data_loader.test_label[start_index:end_index],
        y_pred=y_pred
    )
print('test accuracy: %f'%sparse_categorical_accuracy.result())
print('now end the test, time is ')
print(time.strftime('%Y-%m-%d %H:%M:%S',time.localtime()))

The prediction accuracy of the operation can reach 99.15%. It's incredible.
Output result:

batch 5999: loss 0.094517
now end the train, time is 
2021-03-18 17:15:46
test accuracy: 0.991500
now end the test, time is 
2021-03-18 17:16:05

to sum up

To build a convolutional neural network, you only need to meet: determine the number of layers, define the input and output matching between each layer according to the process of convolution, activation, pooling, etc., to output the result, you need to put two-dimensional or even multi-dimensional The matrix is ​​flattened into a large one-dimensional matrix, and then a multi-layer neural network at the output can be constructed with full connections. The final output layer uses the softmax function for classification, and the result with the highest output probability is our prediction result.
Therefore, the construction of a convolutional neural network is realized. The accuracy rate is already pretty good.

Guess you like

Origin blog.csdn.net/weixin_43290383/article/details/114964920