Improving neural networks - optimization algorithms (mini-batch, momentum gradient descent method, Adam optimization algorithm)

Optimization algorithms can make neural networks run faster. The application of machine learning is a process that is highly dependent on experience. With a large number of iterations, you need to train many models to find the right one. Therefore, optimization algorithms can help you Quickly train models.

One of the difficulties is that deep learning has not exerted its greatest effect in the field of big data. We can use a huge data set to train a neural network, and training based on a huge data set is very slow. Therefore, you will find that using fast optimization algorithms and using easy-to-use optimization algorithms can greatly improve the efficiency of you and your team.

gradient descent

In machine learning, the simplest is gradient descent (GD, Gradient Descent) without any optimization. We learn the entire training set in each cycle. This is called batch gradient descent (Batch Gradient Descent). Refer to the article Machine Learning - —Gradient descent algorithm.

When using the gradient descent algorithm, assume that m samples are calculated and the training samples are enlarged into a huge matrix XXAmong X , X = [ x ( 1 ) x ( 2 ) x ( 3 ) . . . x ( m ) ] X=[x^{(1)}x^{(2)}x^{(3) }...x^{(m)}]X=[x(1)x(2)x(3)...x(m)] Y Y Y也是如此,Y = [ y ( 1 ) y ( 2 ) y ( 3 ) . . . y ( m ) ] Y=[y^{(1)}y^{(2)}y^{(3)}...y^{(m)}]Y=[y(1)y(2)y(3)...y( m ) ], vectorization allows you to process all m samples relatively quickly. If m is large, the processing speed is still slow. For example, if m is 5 million or 50 million or a larger number, when you perform gradient descent on the entire training set, what you have to do is, you have to process the entire training set before you can perform one-step gradient descent, Then you need to reprocess 5 million training samples before proceeding to the next step of gradient descent. So if you let gradient descent process a portion of the entire 5 million-sample training set before processing it, your algorithm will be faster.

Mini-batch Gradient Descent

Divide the training set into smaller subsets for training, and these subsets are named mini-batch.

The gradient descent algorithm evolved into the stochastic gradient descent (SGD) algorithm and the mini-batch gradient descent algorithm (Mini-batch Gradient Descent). Stochastic gradient descent is equivalent to mini-batch gradient descent, but unlike mini-batch, there is only 1 sample in each mini-batch. Unlike gradient descent, you can only calculate on one training sample at a time. gradient instead of computing the gradient over the entire training set.

In stochastic gradient descent, only 1 training sample is used before updating the gradient. When the training set is larger, stochastic gradient descent can be faster, but the parameters will swing towards the minimum value instead of smoothly converging. Let's take a look at the comparison diagram: In
Insert image description here
practice, a better approach is to use mini-batches (mini- batch) gradient descent method, the mini-batch gradient descent method is a method that combines the gradient descent method and the stochastic gradient descent method. In each iteration, it neither selects all the data to learn nor selects a sample to learn. Instead of learning, it divides all the data sets into small pieces for learning. It will randomly select a small piece (mini-batch), and the block size is generally n times 2.

Assume that there are only 1000 samples in each subset, then x (1) x^{(1)}x( 1 ) tox (1000) x^{(1000)}x( 1000 ) is taken out, call it the first sub-training set, also called mini-batch, and then you take out the next 1000 samples, starting fromx (1001) x^{(1001)}x( 1001 ) tox (2000) x^{(2000)}x( 2000 ) , and then take another 1000 samples, and so on.

Assume that the training samples m = 5 million, and each mini-batch has 1,000 samples, that is, there are 5,000 mini-batches. Put x ( 1 ) x^{(1)}x( 1 ) tox (1000) x^{(1000)}x( 1000 ) is calledX { 1 } X^{\{1\}}X{ 1} x ( 1001 ) x^{(1001)} x( 1001 ) tox (2000) x^{(2000)}x( 2000 ) is calledX { 2 } X^{\{2\}}X{ 2 } , and so on untilX (5000) X^{(5000)}X( 5000 ), 对YYY undergoes the same process. The number t of mini-batch constitutesX { t } X^{\{t\}}X{ t} Y { t } Y^{\{t\}} Y{ t} X { t } X^{\{t\}} X The dimension of { t } is (nx, 1000) (n_{x},1000)(nx,1000) n x n_{x} nxis the number of sample input units, Y { t } Y^{\{t\}}Y The dimension of { t } is ( 1 , 1000 ) (1,1000)(1,1000)

Previously we used upper angle brackets (i) to represent the values ​​in the training set, so x (i) x^{(i)}x( i ) is the i-th training sample. We use the upper square brackets [ l ] to indicate the number of layers of the neural network,z [ l ] z^{[l]}z[ l ] represents the z value of the lth layer in the neural network. We now introduce curly brackets t to represent different mini-batches, so we haveX { t } X^{\{t\}}X{ t} Y { t } Y^{\{t\}} Y{ t}

To run the mini-batch gradient descent method on the training set, you run for t=1...5000, because we have 5000 groups with 1000 samples each, and what you have to do in the for loop is basically to calculate X { t } X^{\{t\}}X{ t} Y { t } Y^{\{t\}} Y{ t } performs one-step gradient descent.

Insert image description here

When using the batch gradient descent method, each iteration needs to go through the entire training set, and the cost can be expected to decrease with each iteration, so if the cost function J is a function of the number of iterations, it should decrease with each iteration, If J increases in a certain iteration, something is definitely wrong, maybe your learning rate is too high.

Using the mini-batch gradient descent method, if you draw a graph of the cost function throughout the process, not every iteration is descending, especially in each iteration, what you have to deal with is X { t } X ^ {\{t\}}X{ t} Y { t } Y^{\{t\}} Y{ t } , make the cost functionJ { t } J^{\{t\}}JThe graph of { t } is only related toX { t } X^{\{t\}}X{ t} Y { t } Y^{\{t\}} Y{ t } The result of the plot of the cost function J goes downwards, but with more noise. This is because you are training a different sample set or a different mini-batch in each iteration.

When using the mini-batch gradient descent method, one of the variables that needs to be decided is the size of the mini-batch. m is the size of the training set. In extreme cases, if the size of the mini-batch is equal to m, it is actually the batch gradient descent method; assuming The mini-batch size is 1, which is the stochastic gradient descent method.

In the stochastic gradient descent method, starting from a certain point, we re-select a starting point. Each iteration, you only perform gradient descent on one sample. Most of the time you move towards the global minimum, and sometimes you move away from the minimum. Because that sample happens to point you in the wrong direction, the stochastic gradient descent method has a lot of noise. On average, it will end up close to the minimum, but sometimes it will also go in the wrong direction, because the stochastic gradient descent method will never converge. , but will always fluctuate around the minimum value, but it will not reach the minimum value and stay there.

In addition, a major disadvantage of the stochastic gradient descent method is that you will lose all the acceleration brought by vectorization, because only one training sample is processed at a time, which is too inefficient, so in practice it is best to choose a medium-sized Mini-batch size actually achieves the fastest learning rate. On the one hand, you get a lot of vectorization, if the mini-batch size is 1000 samples, you can vectorize 1000 samples much faster than if you process multiple samples at once. On the other hand, you don’t need to wait for the entire training set to be processed before you can start working on it.

exponentially weighted average

The exponentially weighted average, also known as the moving average, can be seen from the name. It is actually an averaging algorithm. It will be used when we introduce momentum, RMSProp and Adam later.

Calculation
The exponential weighted average calculation formula:
vt = β vt − 1 + ( 1 − β ) θ t v_t = \beta v_{t-1} + (1- \beta)\theta _tvt=b vt1+(1b ) it
Among them β \betaβ refers to the coefficient usually taking 0.9,vn v_nvnRepresents the current exponential weighted average, θ \thetaθ represents the current value. Let’s use an example to introduce this.
Insert image description here
The picture above is a scatter chart of the temperature in London, England, for a certain year. This data is used to calculate the moving average of the temperature.
What you have to do is, first makev 0 = 0 v_0=0v0=0 , every day, you need to use the weighted number of 0.9 to add the previous value to 0.1 times the temperature of the day, that is,v 1 = 0.9 v 0 + 0.1 θ 1 v_1=0.9v_0+0.1\theta_1v1=0.9v _0+0.1 i1, so here is the temperature value for the first day.

The next day, a weighted average can be obtained, 0.9 multiplied by the previous value plus 0.1 times the temperature of the day, that is, v 2 = 0.9 v 0 + 0.1 θ 2 v_2=0.9v_0+0.1\theta_2v2=0.9v _0+0.1 i2, and so on.

Calculate in this way and then plot it with a red line, and you will get the result as shown in the figure below.
Insert image description here
When calculating, visible vt v_tvtProbably 1 1 − β \frac{1}{1-\beta}1 b1daily average temperature, if β \betaβ is 0.9,vt v_tvtIt will be the ten-day average, which is the red heart in the picture above.

If β \betaSet β to a value close to 1, such as 0.98, and calculate1 1 − 0.98 = 50 \frac{1}{1-0.98}=5010.981=50 , which is a rough average of the temperatures over the past 50 days, and the green line shown in the figure below is obtained by drawing the graph.
Insert image description here
Forβ = 0.98 \beta=0.98b=At 0.98 , there are a few points to note. The curve you get should be flatter. The reason is that you have averaged the temperature for a few more days, so this curve has smaller fluctuations and is flatter. The disadvantage is that the curve moves further to the right because the average temperature is now There are more values, and more values ​​need to be averaged. The exponentially weighted average formula adapts more slowly when the temperature changes, so there will be a certain delay, because whenβ = 0.98 \beta=0.98b=0.98 , which is equivalent to adding too much weight to the previous day's value. Only 0.02 weight is given to the current day's value, so when the temperature changes, the temperature rises and falls. Whenβ \betaWhen β is larger, the exponentially weighted average adapts more slowly.

If β \betaSetting β to 0.5 will only average the temperatures over two days.
Insert image description here
Since the temperatures of only two days are averaged, the average data is too little, so the curve obtained has more noise and may have outliers, but this curve can adapt to temperature changes faster.

By adjusting this parameter ( β \betaβ will be a very important parameter in the subsequent learning algorithm) and can achieve slightly different effects. There is often a value in the middle that has the best effect,β \betaThe red curve obtained when β is an intermediate value averages the temperature better than the green and yellow lines.

One of the benefits of the exponentially weighted average formula is that it takes up very little memory. It only occupies one row of numbers in the computer memory. Then you can substitute the latest data into the formula and continue to overwrite it. For this reason, its efficiency is basically Taking up only one line of code, calculating the exponentially weighted average only takes up the storage and memory of a single line of numbers.

Bias correction
Insert image description here
In the actual calculation process, β \betaWhen β is equal to 0.98, what you get is not a green curve, but a purple curve. You can notice that the starting point of the purple curve is lower. This is because when calculating the moving average,v 0 = 0 v_0=0v0=0v 1 = 0.9 v 8 0 + 0.02 θ 1 v_1=0.9v8_0+0.02\theta_1v1=0.9v8 _ _0+0.02 i1, but v 0 = 0 v_0=0v0=0 , so the calculation isv 1 = 0.02 θ 1 v_1=0.02\theta_1v1=0.02 i1, so the obtained value will be much smaller. The estimate of the temperature on the first day is inaccurate. Also substitute v 2 v_2v2Get the calculation formula, v 2 v_2v2It is also not possible to estimate the temperature of the first two days of the year very well.

There is a way to modify this estimate to make it better and more accurate, especially in the early stages of the estimate, that is, without vt v_tvt, instead use vt 1 − β t \frac{v_t}{1-\beta^t}1 btvt

For example, when t = 2 t=2t=When 2 ,1 − β t = 1 − 0.9 8 2 = 0.0396 1-\beta^t=1-0.98^2=0.03961bt=10.982=0.0396 , so the estimate of the next day's temperature becomesv 2 0.0396 = 0.0196 θ 1 + 0.02 θ 2 0.0396 \frac{v_2}{0.0396}=\frac{0.0196\theta_1+0.02\theta_2}{0.0396}0.0396v2=0.03960.0196 i1+ 0.02 i2, that is, θ 1 \theta_1i1Sum θ 2 \theta_2i2weighted average, with the bias removed. You will find that with ttt increases,β t \beta^tbt is close to 0, so whenttWhen t is very large, bias correction has almost no effect, so whenttWhen t is larger, the purple line basically coincides with the green line.

Gradient descent with momentum

In ordinary gradient descent, if you encounter a more complex situation, it will appear: if the learning rate is too large, the swing is too large, and the error is large; if the learning rate is too small, the number of iterations will increase, and the learning time will be very long. long. The above situations are often encountered in neural network models. There will always be a situation where the solution oscillates in a small range and it is difficult to achieve the optimal solution.

Momentum gradient can better avoid the above problems. Its process is similar to a small ball with mass rolling down on a function curve. When the ball reaches the lowest point, it will continue to rise for a while due to its inertia, then roll back, and then go up through the lowest point... Finally, the ball will stop at the lowest point. And because the ball has inertia, when the function curve or surface is steep and complex, it can cross these and reach the lowest point as quickly as possible.

The momentum gradient descent algorithm makes some optimizations based on the gradient descent algorithm. The convergence of the gradient descent algorithm is shown in the figure below.
Insert image description here
The black dot represents the starting point and the red dot represents the end point. Through continuous iteration of the gradient descent algorithm, slowly move from the black point to the red point. From the above figure, you can find that the movement trajectory of the black point has been fluctuating up and down in the y direction, and this is for the black point to move to the red point. The position of is not helpful, but is a waste of time, because we hope to reduce this unnecessary calculation and accelerate the pace of movement in the x-axis direction, as shown in the figure below. The parameter update formula of the gradient downhole algorithm: w
Insert image description here
= w − α ∗ dww=w-\alpha*dw

w=wadw
b = b − α ∗ d b b=b-\alpha*db b=bad b
Momentum gradient descent algorithm parameter update formula:
vw = β vw + ( 1 − β ) dw v_w = \beta v_w + (1- \beta)dwvw=b vw+(1b ) d w
vb = β vb + ( 1 − β ) db v_b = \beta v_b + (1- \beta)dbvb=b vb+(1β)db
w = w − α ∗ v w w=w-\alpha*v_w w=wavw
b = b − α ∗ v b b=b-\alpha*v_b b=bavb

In the above formula, α \alphaα represents the learning rate. From the above formula, it can be found that the momentum gradient descent algorithm does not directly use the gradient when updating parameters. It also uses the previous gradient, and the specific amount is related to β \betaIt is related to the size of β ,β \betaThe larger β is, the more previous gradients are used,β \betaThe smaller β is, the smaller the previous gradient is used.

Because the gradient is positive or negative in the y-axis direction, the mean value is approximately 0, that is, there will not be much change in the y-axis direction. The gradient in the x-axis direction is consistent, so it can provide accelerated momentum updates in the x-axis direction. Due to the momentum gradient descent algorithm, when updating parameters, not only the current gradient is used, but also the mean value of the previous gradient is used as momentum. When it falls into a local minimum (the gradient is close to 0), the momentum gradient descent algorithm still uses it when updating parameters. The mean of previous gradients can be used to jump out of local minima, while the gradient descent algorithm will only fall into local minima.

RMSprop algorithm

The full name of the RMSprop algorithm is root mean square prop. The idea of ​​the RMSprop algorithm is consistent with that of the Moment algorithm. They both reduce the jitter in the y-axis direction and increase the movement step in the x-axis direction. The implementation is slightly different. Moment mainly uses the accumulation of previous gradients to achieve acceleration, while the idea of ​​​​the RMSprop algorithm is to use the gradient descent algorithm to have a relatively large gradient in the y-axis direction, while the gradient in the x-axis direction is relatively small. Small. When updating parameters, let the gradient in the y-axis direction ( db dbd b ) is divided by a large number, so that the magnitude of the y-axis update is small. And the gradient in the x-axis direction (dw dwd w ) divided by a small number, so that the x-axis update is larger. Thus, the update step size in the y-axis direction is reduced and the update step size in the x-axis direction is increased, allowing the algorithm to converge faster. The update formula is as follows:

S dw = β S dw + ( 1 − β ) dw 2 S_{dw} = \beta S_{dw} + (1- \beta)dw^2Sdw=βSdw+(1β ) dw _2
S d b = β S d b + ( 1 − β ) d b 2 S_{db} = \beta S_{db} + (1- \beta)db^2 Sdb=βSdb+(1b ) d b2
w = w − α ∗ dw S dw + ε w=w-\alpha*\frac{dw}{\sqrt{S_{dw}}+\varepsilon }w=waSdw + edw
b = b − α ∗ dw S db + ε b=b-\alpha*\frac{dw}{\sqrt{S_{db}}+\varepsilon};b=baSdb + edw

In order to avoid the denominator being 0 when updating parameters, a very small number ε \varepsilon needs to be added to the denominatorε , usually takes1 0 − 8 10^{-8}108dw 2 dw^2dw2 represents the parameterwwThe square of the gradient of w is also called the square of the differential.

Let the vertical and horizontal axis directions be called bb respectively.b sumlolw , just for convenience of display. In practice, you'll be in a high-dimensional space of parameters, so you need to eliminate the vertical dimension of the wiggle, and you need to eliminate the wiggle, which is actually the parametersw 1 , w 2 w_1, w_2w1w2etc., the horizontal dimensions may be w 3, w 4 w_3, w_4w3w4Wait, so put bbb sumlolThe separation of w is just for convenience. Actualdw dwd w is a high-dimensional parameter vector,db dbd b is also a high-dimensional parameter vector.

Adam algorithm

The full name of the Adam algorithm is Adaptive Moment Estimation, which mainly combines the Moment algorithm and the RMSprop algorithm. The formula is as follows:

vdw = 0 , vdb = 0 , S dw = 0 , S db = 0 v_{dw}=0,v_{db}=0,S_{dw}=0,S_{db}=0vdw=0,vdb=0,Sdw=0,Sdb=0
vdw = β 1 vdw + ( 1 − β 1 ) dw v_{dw} = \beta_1 v_{dw} + (1- \beta_1)dwvdw=b1vdw+(1b1) d w
vdb = β 1 vdb + ( 1 − β 1 ) db v_{db} = \beta_1 v_{db} + (1- \beta_1)dbvdb=b1vdb+(1b1) d b
S dw = β 2 S dw + ( 1 − β ) dw 2 S_{dw} = \beta_2 S_{dw} + (1- \beta)dw^2Sdw=b2Sdw+(1β ) dw _2
S d b = β 2 S d b + ( 1 − β ) d b 2 S_{db} = \beta_2 S_{db} + (1- \beta)db^2 Sdb=b2Sdb+(1b ) d b2

Parameter update:

w = w − α ∗ vdw S dw + ε w=w-\alpha*\frac{v_{dw}}{\sqrt{S_{dw}}+\varepsilon }w=waSdw + evdw
b = b − α ∗ v d b S d b + ε b=b-\alpha*\frac{v_{db}}{\sqrt{S_{db}}+\varepsilon } b=baSdb + evdb

When using the exponential weighted average algorithm, there may be a situation where the initialization deviation is relatively large. The deviation can be corrected through the following method:

v d w c o r r e c t e d = v d w 1 − β 1 t v_{dw}^{corrected}=\frac{v_{dw}}{1-\beta_{1}^{t}} vdwcorrected=1 b1tvdw v d b c o r r e c t e d = v d b 1 − β 1 t v_{db}^{corrected}=\frac{v_{db}}{1-\beta_{1}^{t}} vdbcorrected=1 b1tvdb

S d w c o r r e c t e d = S d w 1 − β 2 t S_{dw}^{corrected}=\frac{S_{dw}}{1-\beta_{2}^{t}} Sdwcorrected=1 b2tSdw S d b c o r r e c t e d = S d b 1 − β 2 t S_{db}^{corrected}=\frac{S_{db}}{1-\beta_{2}^{t}} Sdbcorrected=1 b2tSdb

w = w − α ∗ v d w c o r r e c t e d S d w c o r r e c t e d + ε w=w-\alpha*\frac{v_{dw}^{corrected}}{\sqrt{S_{dw}^{corrected}}+\varepsilon } w=waSdwcorrected + evdwcorrected b = b − α ∗ v d b c o r r e c t e d S d b c o r r e c t e d + ε b=b-\alpha*\frac{v_{db}^{corrected}}{\sqrt{S_{db}^{corrected}}+\varepsilon } b=baSdbcorrected + evdbcorrected

tt in the above formulat represents the number of iterations,β 1 \beta_1b1is the hyperparameter of Momentum, usually 0.9, β 2 \beta_2b2is the hyperparameter of RMSprop, usually 0.999, ε \varepsilonε is used to avoid the case where the denominator is 0 and takes1 0 − 8 10^{-8}108

Reference article
https://blog.csdn.net/weixin_36815313/article/details/105432576
https://blog.csdn.net/sinat_29957455/article/details/88088720

Guess you like

Origin blog.csdn.net/Luo_LA/article/details/132291294
Recommended