Wu Enda's in-depth study of the second lesson of the first week of knowledge summary

Wu Enda's Deep Learning Lesson 2 Knowledge Summary (1)

For self-recording only

1.1 Training, Validation, Testing

1.2 Bias, variance

Bias: Underfitting
training set error rate 50%, validation set error rate 50%
insert image description here
Variance: overfitting
training set error rate 1%, validation set error rate 50%
insert image description here

1.3 Basics of machine learning

Methods for training neural networks:

insert image description here

1.4 Regularization - L2 regularization

L2 regularization:
insert image description herewhere lamata is a regularization parameter, and overfitting is avoided by adjusting lamata.
Regularization in neural networks:
insert image description here
how to use this norm to achieve gradient descent:
insert image description herethe L2 norm at this time is equivalent to weight decay.

1.5 Why regularization can prevent overfitting?

Don't need to know the reason, just watch the video if you want to know: https://www.bilibili.com/video/BV1V441127zE?p=5

If lamata is large enough, w can be taken very small. If w approaches zero, the complex structure of the deep network will disappear, leaving only one network, which can prevent overfitting
insert image description here

1.6 Regularization - dropout regularization

This has been understood before, which is to randomly deactivate some neurons.
Note that after inactivation, the parameters of the layer should be divided by the inactivation probability. For example, if 80% of neurons are retained after inactivation, w·x/80% should be used to ensure that the mean value remains unchanged after random inactivation.
Note: dropout is not used in the test phase, because the test phase wants to output a value as accurate as possible according to the parameter weight obtained by training, and does not want the output result to be random.

1.7 Why dropout works

Similar to L2 regularization.
Two dropout methods:
(1) Layers that are prone to overfitting have a higher inactivation rate. Disadvantages: More parameters are required
(2) some layers use dropout, and some layers do not.
Disadvantages of dropout: the cost function J cannot be determined, and it is impossible to determine whether the loss function converges. => So you can turn off dropout first, make sure the loss converges and then turn it on.

1.8 Other regularization methods

  • data augmentation
  • early stopping: stop training too early
  • insert image description here

1.9 Normalized input

(1) Zero mean
(2) Normalized variance
=> x'=(xu)/sigma (similar to the normalization of normal distribution)
x'~mean is 0 and variance is 1.
Reason for normalization: to find the minimum loss faster. Loss function optimization is faster.
insert image description here

1.10 Exploding and Vanishing Gradients

Due to too many parameters, if all parameters are greater than 1, multiplication will become exponential growth, resulting in a large final result tending to -> gradient explosion
All parameters are less than 1, multiplication will decrease exponentially, and the result tends to 0 -> Gradient disappears

1.11 Weight initialization - one of the methods to solve the explosion of gradient disappearance

Initialize the parameter w with the following formula:
insert image description here
where n is the number of neurons in layer l-1. (My understanding is: because the l layer has n parameters w, it is more reasonable to multiply a 1/n) This is the formula for using the tanh function.
If you use ReLU activation, it is recommended in the video:
insert image description here
but compared to other optimization methods, initialization is not so important.

1.12 Gradient

Bilateral gradients are more accurate than unilateral gradients.
insert image description here

1.13 Gradient test - used to test whether the backpropagation is correct

Calculate the two-norm of the difference between the approximate values ​​of the vectors and then divide it by the two-norm sum of the two vectors (the formula is shown in the figure). If the value is small, there is no problem.
Note:
(1) Only check during debugging, not during training. Because it consumes too much computation.
(2) Once the results are very different, you need to debug to see which step has the largest calculation result.
(3) Don't forget about regularization. A regularization term needs to be calculated.
(4) Gradient testing cannot be used at the same time as dropout (because dropout is random, the gradient cannot be calculated)
insert image description here

Guess you like

Origin blog.csdn.net/yeeanna/article/details/117924891