Article Directory
Neurons
In simple terms, given the weight w 1 w_1w1, w 2 w_2 w2, Deviation bbb and other parameters, then enter the datax 1 x_1x1, x 2 x_2 x2And so on, after nonlinear change to get z, and then change σ to get a, and finally calculate the loss function L(a, y).
Among them, there are three common σ changes as follows:
Sigmoid
The value range of f(x) is (0, 1)
Tanh
The value range is (-1, 1), which is more negative than Sigmoid.
resume
The value range is [0, +∞)
Loss function
Loss function, what we want is that this loss is as small as possible.
L1
L2
Cross entropy
Gradient descent
It is the derivative of the loss function L with respect to W, because we hope that Loss is as small as possible, so we hope that this derivative is negative, then subtracting it from the original W will make W change, indicating that we pay more attention to it, and hope to continue , Don’t forget that W is the weight, so it will be more effective when calculating the weighted average. (Explain why the awkward minus sign in gradient descent)
Simply put, just go in the direction of the gradient, and it will be more intuitive to use the function image to illustrate. Now take the two-dimensional function image as an example: just
walk in the direction of the recess.
Backpropagation
After a large number of calculations, an approximate global optimal solution is obtained.
Convolutional Neural Network
Convolutional layer: edge detection
Input data
Convolution operator
The convolution operator is constantly adjusted through calculation,
and each value is equivalent to a neuron, and the value is the weight W
Convolution operation
The convolution operator is a small matrix. Put the operator on the input data, and take each value of the operator as the weight of the value at the same position of the input data to calculate the weighted average. In
this way, calculate 1 ∗ 1 + 2 ∗ 0 + 0 ∗ 0 + 0 ∗ 1 1 * 1 + 2 * 0 + 0 * 0 + 0 * 11∗1+2∗0+0∗0+0∗1. Get the result 1, store it in another matrix, then move the convolution operator to the next position, and
continue the calculation until all the input data is traversed.
Here the default step size, which is the distance of each translation of the convolution operator, is 1 and the
final convolution result is obtained, which is a 3 * 3 matrix
Advanced
This convolution operator can also have many, many layers, and it looks like this:
then the obtained convolution result can also have many, many layers:
the work of each layer of convolution operator is different, I understand them as detecting different features of the picture .
After this convolution operation, the previous size is M ∗ N ∗ 3 M * N * 3M∗N∗3 , it will become very deep, the last number may become 32, 64 or the like.
Note that the input data can also have many layers. To understand how the single-layer data changes, just superimpose multiple layers of data.
Convolution stride and padding
Step size
The distance that the convolution operator moves each time. If stride = 2,
then the movement is like this:
-
first step:
-
The second step:
-
third step:
Finally get a 2 * 2 matrix
padding
To put it simply, the final convolution result matrix is too small, so fill a circle around it,
assuming padding = 1, add a circle around the result:
Convolution result size
Pooling layer
The structure of the pooling operator is the same as that of the convolution operator, but there are only two types:
taking the maximum value and calculating the weighted average
Fully connected layer
Mainly do the work of the classifier, the common ones are SVM, FCN, global pooling;
the end of the model is connected to the output.
The output information is the predicted category, and the output model considers the most likely category, such as a cat, with a probability of 88%
Classical convolutional neural network structure
ALEXNET
The name of the convolutional neural network was launched in one shot
With more and more hidden layers, the classification accuracy rate drops instead. This kind of phenomenon is more gradient disappearance or gradient explosion . For
example, if there is a weight parameter of 2, 10 iterations is 2 10 = 1024 (gradient explosion), another parameter is very small, 0.5, and it is 0.00097 (gradient disappearance) for ten iterations.
Inappropriate activation function or too large initial weight can also cause the gradient to disappear .
As a reminder, the reasons for the two are not consistent.
ResNet
It is a residual neural network , a model born specifically to solve the previous problems of gradient disappearance and gradient explosion.
The main innovation is to superimpose the original input data and the transformed data together, and continue to pass it on, which can well correct the gradient disappearance and gradient descent
. The neural network in the future can be very long and very long.
Inception
The size of the convolution operator is artificially specified, which is very difficult . Whether to choose 1 1, 3 3, or 5*5,
I just want all of them and do not do multiple choice questions. This is the main part of this model. Innovation:
What kind of operator size should be chosen is also handed over to the model for training!