From neuron to neural network

Neurons

Insert picture description here
In simple terms, given the weight w 1 w_1w1, w 2 w_2 w2, Deviation bbb and other parameters, then enter the datax 1 x_1x1, x 2 x_2 x2And so on, after nonlinear change to get z, and then change σ to get a, and finally calculate the loss function L(a, y).
Insert picture description here
Among them, there are three common σ changes as follows:

Sigmoid

Insert picture description here
Insert picture description here
The value range of f(x) is (0, 1)

Tanh

Insert picture description here
Insert picture description here
The value range is (-1, 1), which is more negative than Sigmoid.

resume

Insert picture description here
Insert picture description here
The value range is [0, +∞)

Loss function

Loss function, what we want is that this loss is as small as possible.

L1

Insert picture description here

L2

Insert picture description here

Cross entropy

Insert picture description here

Gradient descent

Insert picture description here
It is the derivative of the loss function L with respect to W, because we hope that Loss is as small as possible, so we hope that this derivative is negative, then subtracting it from the original W will make W change, indicating that we pay more attention to it, and hope to continue , Don’t forget that W is the weight, so it will be more effective when calculating the weighted average. (Explain why the awkward minus sign in gradient descent)

Simply put, just go in the direction of the gradient, and it will be more intuitive to use the function image to illustrate. Now take the two-dimensional function image as an example: just
walk in the direction of the recess.
Insert picture description here

Backpropagation

After a large number of calculations, an approximate global optimal solution is obtained.

Convolutional Neural Network

Insert picture description here

Convolutional layer: edge detection

Input data

Insert picture description here

Convolution operator

Insert picture description here
The convolution operator is constantly adjusted through calculation,
and each value is equivalent to a neuron, and the value is the weight W

Convolution operation

The convolution operator is a small matrix. Put the operator on the input data, and take each value of the operator as the weight of the value at the same position of the input data to calculate the weighted average. In
Insert picture description here
this way, calculate 1 ∗ 1 + 2 ∗ 0 + 0 ∗ 0 + 0 ∗ 1 1 * 1 + 2 * 0 + 0 * 0 + 0 * 111+20+00+01. Get the result 1, store it in another matrix, then move the convolution operator to the next position, and
Insert picture description here
continue the calculation until all the input data is traversed.
Here the default step size, which is the distance of each translation of the convolution operator, is 1 and the
final convolution result is obtained, which is a 3 * 3 matrix
Insert picture description here

Advanced

This convolution operator can also have many, many layers, and it looks like this:
Insert picture description here
then the obtained convolution result can also have many, many layers:
Insert picture description here
the work of each layer of convolution operator is different, I understand them as detecting different features of the picture .
After this convolution operation, the previous size is M ∗ N ∗ 3 M * N * 3MN3 , it will become very deep, the last number may become 32, 64 or the like.
Note that the input data can also have many layers. To understand how the single-layer data changes, just superimpose multiple layers of data.

Convolution stride and padding

Step size

The distance that the convolution operator moves each time. If stride = 2,
then the movement is like this:

  1. first step:
    Insert picture description here

  2. The second step:
    Insert picture description here

  3. third step:

Insert picture description here
Finally get a 2 * 2 matrix

padding

To put it simply, the final convolution result matrix is ​​too small, so fill a circle around it,
assuming padding = 1, add a circle around the result:
Insert picture description here

Convolution result size

Insert picture description here

Pooling layer

The structure of the pooling operator is the same as that of the convolution operator, but there are only two types:
taking the maximum value and calculating the weighted average

Fully connected layer

Mainly do the work of the classifier, the common ones are SVM, FCN, global pooling;
the end of the model is connected to the output.
The output information is the predicted category, and the output model considers the most likely category, such as a cat, with a probability of 88%

Classical convolutional neural network structure

ALEXNET

The name of the convolutional neural network was launched in one shot
Insert picture description here

Insert picture description here

Insert picture description here
With more and more hidden layers, the classification accuracy rate drops instead. This kind of phenomenon is more gradient disappearance or gradient explosion . For
example, if there is a weight parameter of 2, 10 iterations is 2 10 = 1024 (gradient explosion), another parameter is very small, 0.5, and it is 0.00097 (gradient disappearance) for ten iterations.
Inappropriate activation function or too large initial weight can also cause the gradient to disappear .
As a reminder, the reasons for the two are not consistent.

ResNet

It is a residual neural network , a model born specifically to solve the previous problems of gradient disappearance and gradient explosion.
Insert picture description here
The main innovation is to superimpose the original input data and the transformed data together, and continue to pass it on, which can well correct the gradient disappearance and gradient descent
. The neural network in the future can be very long and very long.
Insert picture description here
Insert picture description here

Inception

The size of the convolution operator is artificially specified, which is very difficult . Whether to choose 1 1, 3 3, or 5*5,
I just want all of them and do not do multiple choice questions. This is the main part of this model. Innovation:
Insert picture description here
Insert picture description here
What kind of operator size should be chosen is also handed over to the model for training!
Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_44092088/article/details/112990186