Gradient Descent, Stochastic Gradient Descent, and Batch Gradient Descent

4. The big family of gradient descent methods (BGD, SGD, MBGD)

4.1 Batch Gradient Descent

    The batch gradient descent method is the most commonly used form of the gradient descent method. The specific method is to use all samples to update the parameters. This method corresponds to the gradient descent algorithm of linear regression in the previous 3.3.1, that is to say 3.3 The gradient descent algorithm of .1 is batch gradient descent.  

    θi=θiαj=0m(hθ(x(j)0,x(j)1,...x(j)n)yj)x(j)iθi=θi−α∑j=0m(hθ(x0(j),x1(j),...xn(j))−yj)xi(j)

    Since we have m samples, the gradient data of all m samples is used when calculating the gradient here.

4.2 Stochastic Gradient Descent

    The stochastic gradient descent method is actually similar to the batch gradient descent method. The difference is that it does not use all the data of m samples when finding the gradient, but only selects one sample j to find the gradient. The corresponding update formula is:

    θi=θiα(hθ(x(j)0,x(j)1,...x(j)n)yj)x(j)iθi=θi−α(hθ(x0(j),x1(j),...xn(j))−yj)xi(j)

    Stochastic gradient descent, and batch gradient descent in 4.1 are two extremes, one uses all data for gradient descent, and the other uses one sample for gradient descent. Naturally, the advantages and disadvantages of each are very prominent. In terms of training speed, the stochastic gradient descent method uses only one sample to iterate, so the training speed is very fast, while the batch gradient descent method cannot satisfy the training speed when the sample size is large. For accuracy, stochastic gradient descent is used to determine the gradient direction with only one sample, resulting in a solution that is likely to be suboptimal. For the convergence speed, since the stochastic gradient descent method iterates one sample at a time, the iterative direction changes greatly, and it cannot quickly converge to the local optimal solution.

    So, is there a middle-of-the-road approach that combines the best of both worlds? have! This is the mini-batch gradient descent method in 4.3.

4.3 Mini-batch Gradient Descent

  The mini-batch gradient descent method is a compromise between the batch gradient descent method and the stochastic gradient descent method, that is, for m samples, we use x patterns to iterate, 1<x<m. Generally, x=10 can be taken. Of course, according to the data of the sample, the value of this x can be adjusted. The corresponding update formula is:

    θi=θiαj=tt+x1(hθ(x(j)0,x(j)1,...x(j)n)yj)x(j)i

 

Note: Each time a single sample data is passed in, all data sets are traversed, which also belongs to stochastic gradient descent.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325170046&siteId=291194637