Assignment 12: Chapter 7 after-school questions

Exercise 7-1 In small-batch gradient descent, try to analyze why the learning rate is proportional to the batch size

In mini-batch gradient descent there is:
5
where gt = δ K g_t = \frac{\delta }{K}gt=Kd,则有:θ t = θ t − 1 − δ K α θ_t = θ_{t-1} - \frac{\delta }{K}αit=it1Kdα
Therefore, we need to optimize the parameters, thenα K \frac{\alpha}{K}Kais a constant at the optimal time, so the learning rate should be proportional to the batch size.

Exercise 7-2 In the Adam algorithm, explain the rationality of the bias correction of the exponential weighted average

0

Exercise 7-9 Prove that in standard stochastic gradient descent, weight decay regularization and L_{2} regularization have the same effect. And analyze whether this conclusion still holds in the momentum method and Adam algorithm

2
zhengming
Analyze whether this conclusion holds true in the momentum method and Adam algorithm?

The direction of the L2 regularized gradient update depends on the weighted average of the gradients in the recent period.
When combined with adaptive gradients (momentum methods and Adam's algorithm),
L2 regularization results in weights with larger historical parameters (and/or) gradient amplitudes being regularized to a smaller extent than when weight decay is used.

Summarize

Insert image description here

Guess you like

Origin blog.csdn.net/weixin_51395608/article/details/128276168