Table of contents
- Exercise 7-1 In small-batch gradient descent, try to analyze why the learning rate is proportional to the batch size
- Exercise 7-2 In the Adam algorithm, explain the rationality of the bias correction of the exponential weighted average
- Exercise 7-9 Prove that in standard stochastic gradient descent, weight decay regularization and L_{2} regularization have the same effect. And analyze whether this conclusion still holds in the momentum method and Adam algorithm
- Summarize
Exercise 7-1 In small-batch gradient descent, try to analyze why the learning rate is proportional to the batch size
In mini-batch gradient descent there is:
where gt = δ K g_t = \frac{\delta }{K}gt=Kd,则有:θ t = θ t − 1 − δ K α θ_t = θ_{t-1} - \frac{\delta }{K}αit=it−1−Kdα
Therefore, we need to optimize the parameters, thenα K \frac{\alpha}{K}Kais a constant at the optimal time, so the learning rate should be proportional to the batch size.
Exercise 7-2 In the Adam algorithm, explain the rationality of the bias correction of the exponential weighted average
Exercise 7-9 Prove that in standard stochastic gradient descent, weight decay regularization and L_{2} regularization have the same effect. And analyze whether this conclusion still holds in the momentum method and Adam algorithm
Analyze whether this conclusion holds true in the momentum method and Adam algorithm?
The direction of the L2 regularized gradient update depends on the weighted average of the gradients in the recent period.
When combined with adaptive gradients (momentum methods and Adam's algorithm),
L2 regularization results in weights with larger historical parameters (and/or) gradient amplitudes being regularized to a smaller extent than when weight decay is used.