Notes for Chapter 2 of "Deep Learning with Graphics" (Xiao Xiaobai's own use)

2.2 MP model

The MP model is a model in which multiple inputs correspond to one output, and can realize the logical calculation of simple operators. The structure is shown in the following figure:
insert image description here

The disadvantage of this model is that the parameters need to be considered certain.

2.3 Perceptrons

The advantage of the perceptron is that it can automatically determine the parameters through training.

The parameters are obtained by adjusting the difference between the actual output and the expected output, which is called error correction learning. Expressed in a formula:
insert image description here

The disadvantage of the perceptron is that it can only solve linearly separable problems, but not linearly inseparable problems.

2.4 Multilayer Perceptron

In order to solve the linear inseparability problem, there is a multi-layer perceptron. The structure of a multi-layer perceptron is as follows:
insert image description here

The multilayer perceptron uses error correction learning to determine the connection weight between the two layers, but it cannot be adjusted across layers. Therefore, the early multilayer perceptron can only correct the weight between the middle layer and the output layer. For the input layer And the middle layer can only use random numbers as weights. The problem with this situation is that different input values ​​may be entered but the same output value may be obtained, which will not be accurately classified.
So how should a multilayer perceptron train the connection weights? Later, humans proposed error backpropagation.

2.5 Error Backpropagation Algorithm

The error backpropagation algorithm is to obtain the error signal by comparing the actual output and the expected output, propagate the error signal from the output layer forward layer by layer to obtain the error signal of each layer, and then adjust the connection weight of each layer to reduce the error. The way to adjust is called the gradient descent algorithm.
insert image description here
The multilayer perceptron weight adjustment process with only one output unit is as follows:
insert image description here
insert image description here

Parameter adjustment:
insert image description here
The weight adjustment process of a multi-layer perceptron with multiple output units is as follows:
insert image description here
Parameter adjustment:
insert image description here
The difference between a single output and multiple outputs is that the weight adjustment value between the input layer and the middle layer is the relevant unit between the middle layer and the output layer The weight adjustments between are worth the sum.

However, the function value obtained after derivation of the activation function may be 0. In this case, the gradient will disappear and the weight cannot be adjusted. For this problem, the learning rate needs to be adjusted during training to prevent the gradient from disappearing.

When the number of layers is large, gradient disappearance and gradient explosion may occur.

2.6 Error function and activation function

In general, the error function (loss function)

Cross-entropy cost function for multi-classification:
insert image description here

For two categories:
insert image description here

Use the least squares error function in recursive problems:
insert image description here

In general, the activation functions are:

sigmod function, tanhh function, ReLU function, etc.

2.7 Likelihood function

The most common likelihood function is the softmax function. This function can solve the following two problems: First, because the output value range of the output layer is uncertain, it is difficult for us to intuitively judge the meaning of these values. Second, since the true labels are discrete values, the error between these discrete values ​​and the output values ​​in an uncertain range is difficult to measure.

2.8 Stochastic gradient descent method

There are many error backpropagation algorithms. The first is the first batch learning algorithm: each iteration will traverse all training samples. This algorithm can effectively keep the noise in the training set, but the training time is longer.

The second is an online learning algorithm that feeds training samples one by one. Therefore, it may cause large changes in the iteration results, so that the training cannot converge.

The third is the mini-batch stochastic gradient descent algorithm, which divides the training set into several subsets, using one subset for each iteration. After all subset iterations are completed, iteratively adjust the weights from the first subset again. This method uses only a small number of samples per iteration, which can shorten the word training time compared to batch learning. Each iteration also uses multiple training samples, which can reduce the variation of iteration results compared to online learning.

2.9 Learning Rate

The learning rate is a coefficient used to adjust how well the weight connections are adjusted. The larger the learning rate, the larger the step, reducing the time of convergence, but too large may lead to failure to converge. And too small may converge very slowly.

In addition, there are some adaptive adjustments to the learning rate.

Guess you like

Origin blog.csdn.net/qq_49785839/article/details/115434133