Neural Network and Deep Learning CNN

1. Traditional artificial neural network (ANN)

A traditional artificial neural network consists of three layers: input layer, hidden layer, and output layer. Each layer is composed of individual neurons. Each connecting line corresponds to a different weight (its value is called weight), which needs to be trained.
insert image description here
insert image description here
Except for the input layer, the nodes of each layer contain a nonlinear transformation. Nonlinearization requires an activation function.

insert image description here
insert image description here
The problem caused by multiple layers is that the complexity is too high and leads to overfitting, so regularization is required.
insert image description here
insert image description here
The backpropagation algorithm (bp)
insert image description here
insert image description here
algorithm process is omitted.

2. CNN

In the artificial fully connected neural network, each neuron between every two adjacent layers is connected by an edge. When the feature dimension of the input layer becomes very high, the parameters that the fully connected network needs to train will increase a lot, and the calculation speed will become very slow. For example, a black and white 28×28 handwritten digital picture, input There are 784 neurons in each layer.

If only one hidden layer is used in the middle, the parameter w will have more than 784×15=11760; if the input is a 28×28 handwritten digital picture in RGB format with color, the input neuron will have 28×28×3 =2352... . It is easy to see that using a fully connected neural network to deal with images requires too many training parameters.

In Convolutional Neural Network (CNN), the neurons of the convolutional layer are only connected to some neuron nodes of the previous layer, that is, the connection between its neurons is not fully connected, and in the same layer The weights w and offsets b of connections between certain neurons are shared (ie, the same), which greatly reduces the number of parameters that need to be trained.

1. CNN level

The structure of the convolutional neural network CNN generally includes these layers:

  • Input layer: for data input
  • Convolution layer: use convolution kernel for feature extraction and feature mapping (feature extraction)
  • Excitation layer: Since convolution is also a linear operation, nonlinear mapping needs to be added
  • Pooling layer: perform down-sampling, sparsely process the feature map, and reduce the amount of data calculation. (Dimensionality reduction, prevent overfitting)
  • Fully connected layer: usually refitting at the end of CNN to reduce the loss of feature information
  • Output layer: used to output results

Of course, some other functional layers can also be used in the middle:

  • Normalization layer (Batch Normalization): normalization of features in CNN
  • Slicing layer: Separate learning of some (picture) data in different regions
  • Fusion layer: Fusion of branches that perform feature learning independently

1.1 Input layer

The input format of the input layer of CNN preserves the structure of the image itself. For a 32x32 image in RGB format, the CNN inputs 3x32x32 neurons.
insert image description here
insert image description here

1.2 Convolution layer

Several concepts need to be clarified:

  • Receptive field (local receptive fields)
    is to experience some features of the previous layer. In the convolutional neural network, the neurons in the hidden layer have a relatively small sensory field of view, and can only see some features of the previous time. Other features of the previous layer can be obtained by translating the sensory field of view to other neurons in the same layer.

  • The convolution kernel
    feels the weight matrix in the field of view

  • shared weights

  • Step length (stride)
    The scan interval of the sensory field of view to the input is called the stride length (stride)

  • Boundary expansion (pad)
    When the step size is relatively large (stride>1), in order to scan some features of the edge, the sensory field of view may be "out of bounds", then the boundary expansion (pad) is required


  • The next layer of neuron matrix generated by a feature map through a sensory field of view scan with a convolution kernel is called a feature map (feature map)
    insert image description here

insert image description here
The example given here is an input picture (5 5 3), convolution kernel (3 3 3), there are two (Filter W0, W1), and there are two bias b (Bios b0, b1), the convolution result Output Volume(3 3 2), stride = 2.

Input: 7 7 3 is because pad = 1 (zeros are filled in both rows and columns at the border of the picture, and the number of zero-filled rows and sums is 1),

(For color pictures, there are generally three colors of RGB, known as 3 channels, 7*7 refers to the picture height h * width w)

, the role of zero padding is to be able to extract the features of the image boundary.

Why should the convolution kernel depth be set to 3? This is because the input is 3 channels, so the convolution kernel depth must be the same as the input depth. As for the convolution kernel width w, the height h can be changed, but the width and height must be equal.

The convolution kernel outputs o[0,0,0] = 3 (the result of the light green box under the Output Volume), how is this result obtained? In fact, the key is to multiply and add the corresponding positions of the matrix (don't confuse it with matrix multiplication)

=> w0[:,:,0] * x[:,:,0] blue area matrix (R channel) + w0[:,:,1] * x[:,:,1] blue area matrix ( G channel) + w0[:,:,2] * x[:,:,2] blue area matrix (B channel) + b0 (do not lose, because y = w * x + b)

First item => 0*1 + 0*1 + 0*1 + 0*(-1) + 1*(-1) + 1*0 + 0*(-1) + 1*1 + 1*0 = 0

Second term => 0 * (-1) + 0 * (-1) + 0 * 1 + 0 * (-1) + 0 * 1 + 1 * 0 + 0 * (-1) + 2 * 1 + 2 * 0 = 2

Third term => 0*1 + 0*0 + 0*(-1) + 0*0 + 2*0 + 2*0 + 0*1 + 0*(-1) + 0*(-1) = 0

Convolution kernel output o[0,0,0] = > first item + second item + third item + b0 = 0 + 2 + 0 + 1 = 3

How is o[0,0,1] = -5 obtained?

Because stride = 2 here, the input window has to slide two steps, which is the area of ​​the red box, and the operation is the same as before

First item => 0*1 + 0*1 + 0*1 + 1*(-1) + 2*(-1) + 2*0 + 1*(-1) + 1*1 + 2*0 = -3

Second term => 0 * (-1) + 0 * (-1) + 0 * 1 + 1 * (-1) + 2 * 1 + 0 * 0 + 2 * (-1) + 1 * 1 + 1 * 0 = 0

Third term => 0 * 1 + 0 * 0 + 0 * (-1) + 2 * 0 + 0 * 0 + 1 * 0 + 0 * 1 + 2 * (-1) + 1 * (-1) = - 3

Convolution kernel output o[0,0,1] = > first item + second item + third item + b0 = (-3) + 0 + (-3) + 1 = -5

Then slide on the input image with this convolution kernel window size, and convolve to get the result, because there are two convolution kernels, so there are two output results.

Friends here may have a question, how to get the output window?

Here is a formula: output window width w = (input window width w - convolution kernel width w + 2 * pad)/stride + 1, output height h = output window width w

Taking the above example, the output window width w = ( 5 - 3 + 2 * 1)/2 + 1 = 3 , then the output window size is 3 * 3, because there are 2 outputs, so it is 3 3 2 .
insert image description here

insert image description here
A receptive field of view has a convolution kernel. We refer to the weight w matrix in the receptive field of view as the convolution kernel; the scanning interval of the receptive field of view to the input is called the stride; when the step size is relatively large (stride> 1) In order to scan some features of the edge, the field of view may be "out of bounds". In this case, the boundary expansion (pad) is required, and the boundary expansion can be set to 0 or other values. The size of the step and bounds expansion values ​​are user-defined.

The size of the convolution kernel is defined by the user, that is, the size of the defined field of view; the value of the weight matrix of the convolution kernel is the parameter of the convolutional neural network. In order to have an offset item, the convolution kernel can be attached with an offset Transposition b, their initial values ​​can be randomly generated, and can be changed through training.

We call the next layer of neuron matrix generated by scanning the sensory field of view with a convolution kernel as a feature map (feature map). The
insert image description here
convolution kernels used by neurons on the same feature map are the same, so These neurons share weights, sharing the weights in the convolution kernel and the accompanying offsets. A feature map corresponds to a convolution kernel. If we use 3 different convolution kernels, we can output 3 feature maps: (receptive field: 5×5, stride: 1)
insert image description here

1.3 Incentive layer

The excitation layer mainly performs a nonlinear mapping on the output of the convolutional layer, because the calculation of the convolutional layer is still a linear calculation. The activation function used is generally the ReLu function.

  • Why use the ReLU function?

It can be seen from y = w * x + b that if the activation function is not used, the output of each network layer is a linear output, and the real scene we are in is actually more of a variety of nonlinear distributions.

This also shows that the function of the activation function is to transform the linear distribution into a nonlinear distribution, which can be closer to our real scene.
insert image description here

  • Why use the ReLU function instead of the sigmoid function?

When they are x ->, the output becomes a constant value, because the first-order partial derivative of the function needs to be obtained when finding the gradient, and whether it is sigmoid or tanhx, their partial derivatives are 0, that is, there is a so-called gradient The disappearance problem will eventually cause the weight parameters w and b to fail to be updated. In contrast, Relu does not have such a problem. In addition, when x > 0, Relu derivation = 1, which can greatly simplify the calculation of dw and db for backpropagation.

Using sigmoid also has the problem of gradient explosion. For example, in the case of a large number of iterations of forward propagation and backpropagation, because sigmoid is an exponential function, some values ​​​​in the result will accumulate in iterations and become exponential. grows, eventually leading to NaNs and overflow.
insert image description here

1.4 Pooling layer

insert image description here

The pooling layer is generally after the convolutional layer + Relu, and its function is:

1. Reduce the size of the input matrix (only width and height, not depth), and extract the main features. (It is undeniable that after pooling, the features will have a certain loss, so some classic models have removed the pooling layer).

Its purpose is obvious, which is to reduce calculations in subsequent operations.

2. Generally, mean_pooling (mean pooling) and max_pooling (maximum pooling) are used. There are translation (translation) and rotation (rotation) for the input matrix, which can ensure the invariance of features.

mean_pooling is to calculate the average value of the input matrix pooling area. It should be noted here that the step size of the pooling window sliding in the input matrix is ​​related to stride. Generally, stride = 2. max_pooling maximum
insert image description here
pooling is the maximum value of each pooling area. in the output corresponding position.
insert image description here

1.4 Fully connected layer

The fully connected layer mainly refits the features to reduce the loss of feature information; the output layer is mainly ready to output the final target result. For example, the structure diagram of VGG is shown in the following figure:

1.5 Normalization layer

  • Batch Normalization

Batch Normalization (batch normalization) realizes the preprocessing operation in the middle of the neural network layer, that is, the input of the previous layer is normalized before entering the next layer of the network, which can effectively prevent "gradient dispersion" ", to speed up network training.
The specific algorithm of Batch Normalization is shown in the figure below:
insert image description here
each time training, samples of batch_size size are taken for training. In the BN layer, a neuron is regarded as a feature, and batch_size samples will have batch_size values ​​in a certain feature dimension. , and then carry out the mean and variance of these samples on the xi dimension of each neuron, obtain xi∧ through the formula, and then perform linear mapping through the parameters γ and β to obtain the corresponding output yi of each neuron. In the BN layer, it can be seen that each neuron dimension has a parameter γ and β, which can be optimized through training just like the weight w.

When batch normalization is performed in a convolutional neural network, the feature map that has not been activated by ReLu is generally batch-normalized, and the output is then used as the input of the excitation layer, which can achieve the effect of adjusting the partial derivative of the excitation function.

One approach is to use the neurons in the feature map as the feature dimension, and the sum of the parameters γ and β is equal to 2×fmapwidth×fmaplength×fmapnum, so the number of parameters will become a lot;

Another approach is to regard a feature map as a feature dimension. The neurons on a feature map share the parameters γ and β of this feature map. The sum of the parameters γ and β is equal to 2×fmapnum, and the mean and variance are calculated as The mean and variance of each feature map dimension in batch_size training samples.

Note: fmapnum refers to the number of feature maps of a sample, and feature maps have a certain order like neurons.

The difference between the training process and the testing process of the Batch Normalization algorithm:

During the training process, we will put the batch_size number of training samples into the CNN network for training each time, and the mean and variance required to calculate the output can naturally be obtained in the BN layer;

In the test process, we often only input one test sample into the CNN network, which means that the mean and variance calculated in the BN layer will be 0, because there is only one sample input, so the input of the BN layer will also appear very large. The problem, which leads to errors in the output of the CNN network. Therefore, in the testing process, we need to use the mean and variance of each dimension when all samples in the training set are normalized in the BN layer. Of course, for the convenience of calculation, we can normalize each batch_num times in the BN layer When normalizing, add the mean and variance of each dimension, and finally calculate the mean again.

  • Local Response Normalization

The normalization method of Local Response Normalization mainly occurs between the outputs of different adjacent convolution kernels (after ReLu), that is, the input occurs in different feature maps after ReLu.

The formula of LRN is as follows:

b(i,x,y)=a(i,x,y)(k+α∑min(N−1,i+n/2)j=max(0,i−n/2)a(j,x,y)2) β

Among them:
a(i,x,y) represents the value at the (x, y) position on the feature map of the output of the i-th convolution kernel (through the ReLu layer).
b(i,x,y) represents the output of a(i,x,y) after LRN.
N represents the number of convolution kernels, that is, the number of input feature maps.
n represents the number of convolution kernels (or feature maps) of the neighbors, which is determined by oneself.
k, α, β are hyperparameters, which are adjusted or determined by the user.

The difference with BN: BN is based on the data of the mini batch, and the neighbor normalization only needs to be determined by itself. There are learning parameters in BN training; BN normalization mainly occurs between different samples, and LRN normalization mainly occurs between different samples. Between the output of the convolution kernel.

2. CNN application scenarios

The application of convolutional neural network is not insignificant. There are two main categories, data prediction and image processing. Naturally, there is no need to say more about data prediction. Image processing mainly includes applications in image classification, detection, recognition, and segmentation.

  • Image Classification: Scene Classification, Object Classification

  • Image detection: saliency detection, object detection, semantic detection, etc.

  • Image recognition: face recognition, character recognition, license plate recognition, behavior recognition, gait recognition, etc.

  • Image Segmentation: Foreground Segmentation, Semantic Segmentation


reference:

Guess you like

Origin blog.csdn.net/Mason_Chen/article/details/110305156