Neural network basic study notes (6) convolutional neural network

Convolutional Neural Network (Convolutional Neural Network, CNN)

7.1 Overall structure

In the neural network introduced earlier, all neurons in adjacent layers are connected , which is called fully-connected.

CNN

7.2 Convolutional layer

What is the problem with the fully connected layer? That is , the shape of the data is "ignored" . For example, when the input data is an image, the image is usually a 3-dimensional shape in the direction of height, length, and channel.

In CNN, the input and output data of the convolutional layer is sometimes called a feature map . Wherein the input data is referred to as a convolution layer input characteristic diagram (input feature map), the output data is called an output characteristic diagram (output feature map).

7.2.2 Convolution operation

As shown in Figure 7-4 , the element of the filter at each position is multiplied by the corresponding element of the input, and then summed (sometimes this calculation is called the multiplication accumulation operation ).

Reference paper:

The specific calculation process:

CNN can also be biased

7.2.3 Fill

Before processing the convolutional layer, it is sometimes necessary to fill in fixed data (such as 0, etc.) around the input data. This is called padding.

In this example, the padding is set to 1, but the padding value can also be set to any integer such as 2, 3.

In the example in Figure 7-5, if the padding is set to 2, the size of the input data becomes (8, 8); if the padding is set to 3, the size becomes (10, 10).

The use of padding is mainly to adjust the size of the output

7.2.4 Stride

The interval at which the filter is applied is called the stride

Assume that the input size (H, W is) , the size of the filter (FH, FW) , the output size is (OH, OW) , filled P , a stride S . At this time, the output size can be calculated by formula (7.1)

Use the formula to calculate:

Since it is a division, it is necessary to pay attention to whether it can be divided , and if it cannot be divided, measures such as error reporting shall be taken.

According to the framework of deep learning, when the value cannot be divided, it will sometimes round to the nearest integer, and continue to run without reporting an error.

7.2.5 Convolution operation of 3D data

The previous examples of convolution operations are based on two-dimensional shapes with high and long directions .

In addition to the height and length directions, it is also necessary to deal with the channel direction .

Figure 7-8 is an example of convolution operation, and Figure 7-9 is the calculation sequence . Here , the data of 3 channels is taken as an example to show the result of the convolution operation.

The input data and the number of channels of the filter are the same, both are 3

The number of channels can only be set to the same value as the number of input data channels

7.2.6 Thinking with cubes

When expressing 3-dimensional data as a multi-dimensional array, the writing order is (channel, height, width) . For example, the shape of data with channel number C, height H, and length W can be written as (C, H, W). When the number of channels is C, the filter height is FH (Filter Height), and the length is FW (Filter Width), it can be written as (C, FH, FW)

In this example, the data output is a feature map. FIG called a feature , in other words, on the number of channels is characterized FIG . So, what if you want to have the output of multiple convolution operations in the channel direction?

As shown in Figure 7-11, regarding the filter of the convolution operation, the number of filters must also be considered . Therefore, as 4-dimensional data, the weight data of the filter should be written in the order of (output_channel, input_channel, height, width).

For example, when there are 20 filters with 3 channels and a size of 5 × 5, they can be written as (20, 3, 5, 5).

Format is very important

In the example of Fig. 7-11, if the offset addition processing is further added , the result is shown in Fig. 7-12 below.

7.2.7 Batch processing

The batch processing summary N times became once were

7.3 Pooling layer

Pooling is an operation to reduce the space in the height and length directions .

The pooled window size will be set to the same value as the stride

Characteristics of pooling layer

  • The number of channels does not change

  • Robust to small position changes (robust)

7.4 Implementation of Convolutional Layer and Pooling Layer

7.4.1 4-dimensional array

The data transferred between layers in CNN is 4-dimensional data .

7.4.2 Expansion based on im2col

The name im2col is the abbreviation of "image to column" , which translates to " from image to matrix ".

We do not use the for statement, but use the convenient function im2col for simple implementation.

im2col is a function that expands the input data to fit the filter (weight) . As shown in FIG. 7-17, after the input data of the application im2col 3-dimensional, 2-dimensional matrix into data (properly speaking, the package is to contain the batch quantity of 4-dimensional data converted into two-dimensional data ).

im2col will expand the input data to fit the filter (weight) . Specifically, as shown in Fig. 7-18, for the input data, the area (3-dimensional square) to which the filter is applied is horizontally expanded into one column. im2col will perform this expansion process wherever the filter is applied.

note:

  • In order to facilitate observation , the step width is set to be large, so that the application area of ​​the filter does not overlap.
  • In the actual convolution operation, the application areas of the filters are almost overlapped .

After using im2col to expand, the number of expanded elements will be more than that of the original square .

Therefore, the implementation using im2col has the disadvantage of consuming more memory than the ordinary implementation .

However, large matrix operations can be optimized

7.4.3 Implementation of Convolutional Layer

Im2col will consider filter size, stride, and padding , and expand the input data into a 2-latitude array

Use the formula mentioned earlier

The following is the convolutional layer implementation class :

The transpose function can exchange the position of the corresponding element of the index based on Numpy:

Note:

(Weight), offset, stride, and fill are received as parameters.

The filter is a 4-dimensional shape of (FN, C, FH, FW) . In addition, FN, C, FH, and FW are the abbreviations of Filter Number, Channel, Filter Height, and Filter Width respectively.

7.4.4 Implementation of the pooling layer

The pooling layer can be expanded separately by channel, as shown in the following figure:

After expansion, select the required function according to the rows of the matrix and select the value . For example, the max function is used here. After the value is taken, the reshape function can be used to reconstruct the dimension, as shown in the figure:

Pooling layer implementation class:

Summary: 3 steps to realize the pooling layer:

1. Expand input data

2. Find the maximum value of each row

3. Convert to a suitable output size

7.5 Implementation of CNN

We have implemented the convolutional layer and the pooling layer, now let's combine these layers.

Build CN N for handwritten digit recognition . Here to implement the CNN as shown in the figure

Initialization of SimpleConvNet (__init__)

Implementation code:

# coding: utf-8
import sys, os
sys.path.append(os.pardir)  # 为了导入父目录的文件而进行的设定
import pickle
import numpy as np
from collections import OrderedDict
from common.layers import *
from common.gradient import numerical_gradient


class SimpleConvNet:
    """简单的ConvNet

    conv - relu - pool - affine - relu - affine - softmax
    
    Parameters
    ----------
    input_size : 输入大小(MNIST的情况下为784)
    hidden_size_list : 隐藏层的神经元数量的列表(e.g. [100, 100, 100])
    output_size : 输出大小(MNIST的情况下为10)
    activation : 'relu' or 'sigmoid'
    weight_init_std : 指定权重的标准差(e.g. 0.01)
        指定'relu'或'he'的情况下设定“He的初始值”
        指定'sigmoid'或'xavier'的情况下设定“Xavier的初始值”
    """
    def __init__(self, input_dim=(1, 28, 28), 
                 conv_param={'filter_num':30, 'filter_size':5, 'pad':0, 'stride':1},
                 hidden_size=100, output_size=10, weight_init_std=0.01):
        filter_num = conv_param['filter_num']
        filter_size = conv_param['filter_size']
        filter_pad = conv_param['pad']
        filter_stride = conv_param['stride']
        input_size = input_dim[1]
        conv_output_size = (input_size - filter_size + 2*filter_pad) / filter_stride + 1
        pool_output_size = int(filter_num * (conv_output_size/2) * (conv_output_size/2))

        # 初始化权重
        self.params = {}
        self.params['W1'] = weight_init_std * \
                            np.random.randn(filter_num, input_dim[0], filter_size, filter_size)
        self.params['b1'] = np.zeros(filter_num)
        self.params['W2'] = weight_init_std * \
                            np.random.randn(pool_output_size, hidden_size)
        self.params['b2'] = np.zeros(hidden_size)
        self.params['W3'] = weight_init_std * \
                            np.random.randn(hidden_size, output_size)
        self.params['b3'] = np.zeros(output_size)

        # 生成层
        self.layers = OrderedDict()
        self.layers['Conv1'] = Convolution(self.params['W1'], self.params['b1'],
                                           conv_param['stride'], conv_param['pad'])
        self.layers['Relu1'] = Relu()
        self.layers['Pool1'] = Pooling(pool_h=2, pool_w=2, stride=2)
        self.layers['Affine1'] = Affine(self.params['W2'], self.params['b2'])
        self.layers['Relu2'] = Relu()
        self.layers['Affine2'] = Affine(self.params['W3'], self.params['b3'])
        
        # 只有SoftmaxWithLoss层被添加到别的变量lastLayer中
        self.last_layer = SoftmaxWithLoss()

    def predict(self, x):
        for layer in self.layers.values():
            x = layer.forward(x)

        return x

    def loss(self, x, t):
        """求损失函数
        参数x是输入数据、t是教师标签
        """
        y = self.predict(x)
        return self.last_layer.forward(y, t)

    def accuracy(self, x, t, batch_size=100):
        if t.ndim != 1 : t = np.argmax(t, axis=1)
        
        acc = 0.0
        
        for i in range(int(x.shape[0] / batch_size)):
            tx = x[i*batch_size:(i+1)*batch_size]
            tt = t[i*batch_size:(i+1)*batch_size]
            y = self.predict(tx)
            y = np.argmax(y, axis=1)
            acc += np.sum(y == tt) 
        
        return acc / x.shape[0]

    def numerical_gradient(self, x, t):
        """求梯度(数值微分)

        Parameters
        ----------
        x : 输入数据
        t : 教师标签

        Returns
        -------
        具有各层的梯度的字典变量
            grads['W1']、grads['W2']、...是各层的权重
            grads['b1']、grads['b2']、...是各层的偏置
        """
        loss_w = lambda w: self.loss(x, t)

        grads = {}
        for idx in (1, 2, 3):
            grads['W' + str(idx)] = numerical_gradient(loss_w, self.params['W' + str(idx)])
            grads['b' + str(idx)] = numerical_gradient(loss_w, self.params['b' + str(idx)])

        return grads

    def gradient(self, x, t):
        """求梯度(误差反向传播法)

        Parameters
        ----------
        x : 输入数据
        t : 教师标签

        Returns
        -------
        具有各层的梯度的字典变量
            grads['W1']、grads['W2']、...是各层的权重
            grads['b1']、grads['b2']、...是各层的偏置
        """
        # forward
        self.loss(x, t)

        # backward
        dout = 1
        dout = self.last_layer.backward(dout)

        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout)

        # 设定
        grads = {}
        grads['W1'], grads['b1'] = self.layers['Conv1'].dW, self.layers['Conv1'].db
        grads['W2'], grads['b2'] = self.layers['Affine1'].dW, self.layers['Affine1'].db
        grads['W3'], grads['b3'] = self.layers['Affine2'].dW, self.layers['Affine2'].db

        return grads
        
    def save_params(self, file_name="params.pkl"):
        params = {}
        for key, val in self.params.items():
            params[key] = val
        with open(file_name, 'wb') as f:
            pickle.dump(params, f)

    def load_params(self, file_name="params.pkl"):
        with open(file_name, 'rb') as f:
            params = pickle.load(f)
        for key, val in params.items():
            self.params[key] = val

        for i, key in enumerate(['Conv1', 'Affine1', 'Affine2']):
            self.layers[key].W = self.params['W' + str(i+1)]
            self.layers[key].b = self.params['b' + str(i+1)]

test acc:0.9892

note:

Let the weight of the convolutional layer of the first layer be the keyword W 1, and the bias be the keyword b1 . Similarly, use keywords W2, b2 and keywords W3, b3 to save the weights and biases of the second and third fully connected layers, respectively.

Prediction and loss function

Is based on the error back propagation method to find the gradient

The implementation part is the same as the previous one:

# 读入数据
(x_train, t_train), (x_test, t_test) = load_mnist(flatten=False)

# 处理花费时间较长的情况下减少数据 
#x_train, t_train = x_train[:5000], t_train[:5000]
#x_test, t_test = x_test[:1000], t_test[:1000]

max_epochs = 20

network = SimpleConvNet(input_dim=(1,28,28), 
                        conv_param = {'filter_num': 30, 'filter_size': 5, 'pad': 0, 'stride': 1},
                        hidden_size=100, output_size=10, weight_init_std=0.01)
                        
trainer = Trainer(network, x_train, t_train, x_test, t_test,
                  epochs=max_epochs, mini_batch_size=100,
                  optimizer='Adam', optimizer_param={'lr': 0.001},
                  evaluate_sample_num_per_epoch=1000)
trainer.train()

# 保存参数
network.save_params("params.pkl")
print("Saved Network Parameters!")

7.6 Visualization of CNN

7.6.1 Visualization of layer 1 weights

We display the filter of the convolutional layer (layer 1) as an image

The filter before learning is initialized randomly , so there is no rule to follow in the black and white shades, but the filter after learning becomes a regular image

7.6.2 Information extraction based on hierarchical structure

As the level deepens , the extracted information (correctly speaking, the neurons that reflect strongly) becomes more and more abstract

7.7 Representative CNN

7.7.1 LeNet

It has a continuous convolutional layer and a pooling layer (correctly speaking, a sub-sampling layer that only "selects elements"), and finally outputs the result through a fully connected layer .

Compared with "current CNN", LeNet has several differences:

  • The first difference lies in the activation function . The sigmoid function is used in LeNet, and the ReLU function is mainly used in the current CNN.

  • Original LeNet using subsampling (subsampling) reduce the size of the intermediate data , now in the CNN cell of the mainstream Max

Although LeNet is slightly different from the current CNN, the difference is not that big. Old things

7.7.2 AlexNet

Its network structure is basically the same as LeNet. AlexNet has multiple convolutional layers and pooling layers, and finally outputs the result through a fully connected layer.

There is no big difference between AlexNet and LeNet, but there are the following differences

summary

  • CNN adds a convolutional layer and a pooling layer to the previous fully connected layer network.
  • Using im2col function can implement convolutional layer and pooling layer simply and efficiently.
  • Through the visualization of CNN, it can be seen that as the level becomes deeper, the information extracted becomes more advanced.
  • LeNet and AlexNet are representative networks of CNN.
  • In the development of deep learning, big data and GPU have made great contributions.

Convolution-ReLu function-pooling layer-convolution-ReLu function-pooling layer-convolution-ReLu function-Affine-ReLu-Affine-Softmax

Guess you like

Origin blog.csdn.net/qq_37457202/article/details/107663421