Wu Enda's Deep Learning Lesson 2 Knowledge Summary (1)
For self-recording only
1.1 Training, Validation, Testing
1.2 Bias, variance
Bias: Underfitting
training set error rate 50%, validation set error rate 50%
Variance: overfitting
training set error rate 1%, validation set error rate 50%
1.3 Basics of machine learning
Methods for training neural networks:
1.4 Regularization - L2 regularization
L2 regularization:
where lamata is a regularization parameter, and overfitting is avoided by adjusting lamata.
Regularization in neural networks:
how to use this norm to achieve gradient descent:
the L2 norm at this time is equivalent to weight decay.
1.5 Why regularization can prevent overfitting?
Don't need to know the reason, just watch the video if you want to know: https://www.bilibili.com/video/BV1V441127zE?p=5
If lamata is large enough, w can be taken very small. If w approaches zero, the complex structure of the deep network will disappear, leaving only one network, which can prevent overfitting
1.6 Regularization - dropout regularization
This has been understood before, which is to randomly deactivate some neurons.
Note that after inactivation, the parameters of the layer should be divided by the inactivation probability. For example, if 80% of neurons are retained after inactivation, w·x/80% should be used to ensure that the mean value remains unchanged after random inactivation.
Note: dropout is not used in the test phase, because the test phase wants to output a value as accurate as possible according to the parameter weight obtained by training, and does not want the output result to be random.
1.7 Why dropout works
Similar to L2 regularization.
Two dropout methods:
(1) Layers that are prone to overfitting have a higher inactivation rate. Disadvantages: More parameters are required
(2) some layers use dropout, and some layers do not.
Disadvantages of dropout: the cost function J cannot be determined, and it is impossible to determine whether the loss function converges. => So you can turn off dropout first, make sure the loss converges and then turn it on.
1.8 Other regularization methods
- data augmentation
- early stopping: stop training too early
1.9 Normalized input
(1) Zero mean
(2) Normalized variance
=> x'=(xu)/sigma (similar to the normalization of normal distribution)
x'~mean is 0 and variance is 1.
Reason for normalization: to find the minimum loss faster. Loss function optimization is faster.
1.10 Exploding and Vanishing Gradients
Due to too many parameters, if all parameters are greater than 1, multiplication will become exponential growth, resulting in a large final result tending to ∞ -> gradient explosion
All parameters are less than 1, multiplication will decrease exponentially, and the result tends to 0 -> Gradient disappears
1.11 Weight initialization - one of the methods to solve the explosion of gradient disappearance
Initialize the parameter w with the following formula:
where n is the number of neurons in layer l-1. (My understanding is: because the l layer has n parameters w, it is more reasonable to multiply a 1/n) This is the formula for using the tanh function.
If you use ReLU activation, it is recommended in the video:
but compared to other optimization methods, initialization is not so important.
1.12 Gradient
Bilateral gradients are more accurate than unilateral gradients.
1.13 Gradient test - used to test whether the backpropagation is correct
Calculate the two-norm of the difference between the approximate values of the vectors and then divide it by the two-norm sum of the two vectors (the formula is shown in the figure). If the value is small, there is no problem.
Note:
(1) Only check during debugging, not during training. Because it consumes too much computation.
(2) Once the results are very different, you need to debug to see which step has the largest calculation result.
(3) Don't forget about regularization. A regularization term needs to be calculated.
(4) Gradient testing cannot be used at the same time as dropout (because dropout is random, the gradient cannot be calculated)