[Deep Learning Notes] Momentum Gradient Descent Method

This column is the study notes of the artificial intelligence course "Neural Network and Deep Learning" of NetEase Cloud Classroom. The video is jointly produced by NetEase Cloud Classroom and deeplearning.ai. The speaker is Professor Andrew Ng. Interested netizens can watch the video of NetEase Cloud Classroom for in-depth study. The link of the video is as follows:

Neural Networks and Deep Learning - NetEase Cloud Classroom

Netizens who are interested in neural networks and deep learning are also welcome to communicate together~

Table of contents

1 Exponentially weighted average

2 Momentum Gradient Descent Method


1 Exponentially weighted average

        Before introducing more complex optimization algorithms, you need to understand Exponentially Weighted Average, which is also called Exponentially Weighted Moving Average in statistics.

 

        Here is the temperature data of London for one year. If you want to know the trend of temperature change in this year, or the local average value of temperature, you can use 0.9 times the average value of the previous day, plus 0.1 times the temperature value of this day, as new average.

 

 

        If the coefficient 0.9 is replaced by β, and 0.1 is replaced by 1-β, the formula of exponential weighted average is obtained.

v_t = \beta v_{t-1} + (1-\beta) \theta_t \, , 0 < \beta < 1 

vt can be interpreted as the average value of 1/(1-β) days, for example, if β is 0.9, 1/(1-β) = 10, vt is approximately equal to the average temperature of 10 days.

        The larger the value of β, the smoother the resulting curve, such as the green curve in the figure above (corresponding to β = 0.98). Because the temperature weight of the previous day is 0.98, the temperature weight of the current day is only 1 - 0.98 = 0.02, and the average value changes more slowly when the temperature changes.

2 Momentum Gradient Descent Method

 

        Assume that in the figure above, the red dot represents the position of the minimum value of the cost function. During the iterative process of the standard gradient descent algorithm, the gradient slowly swings to the minimum value, and the trend of up and down fluctuations slows down the speed of the gradient descent method. With a larger learning rate, the fluctuation may be larger, but reducing the learning rate, the iterative process will also be slower.

 

        With Momentum Gradient Descent, all you need to do is compute an exponentially weighted average of the gradients and then update the weights with that value.

        Like α, here β is also a hyperparameter in the gradient descent algorithm, you need to try different β values, and then choose the best one based on the results.

Guess you like

Origin blog.csdn.net/sxyang2018/article/details/131915340
Recommended