4. The big family of gradient descent methods (BGD, SGD, MBGD)
4.1 Batch Gradient Descent
The batch gradient descent method is the most commonly used form of the gradient descent method. The specific method is to use all samples to update the parameters. This method corresponds to the gradient descent algorithm of linear regression in the previous 3.3.1, that is to say 3.3 The gradient descent algorithm of .1 is batch gradient descent.
θi=θi−α∑j=0m(hθ(x(j)0,x(j)1,...x(j)n)−yj)x(j)iθi=θi−α∑j=0m(hθ(x0(j),x1(j),...xn(j))−yj)xi(j)
Since we have m samples, the gradient data of all m samples is used when calculating the gradient here.
4.2 Stochastic Gradient Descent
The stochastic gradient descent method is actually similar to the batch gradient descent method. The difference is that it does not use all the data of m samples when finding the gradient, but only selects one sample j to find the gradient. The corresponding update formula is:
θi=θi−α(hθ(x(j)0,x(j)1,...x(j)n)−yj)x(j)iθi=θi−α(hθ(x0(j),x1(j),...xn(j))−yj)xi(j)
Stochastic gradient descent, and batch gradient descent in 4.1 are two extremes, one uses all data for gradient descent, and the other uses one sample for gradient descent. Naturally, the advantages and disadvantages of each are very prominent. In terms of training speed, the stochastic gradient descent method uses only one sample to iterate, so the training speed is very fast, while the batch gradient descent method cannot satisfy the training speed when the sample size is large. For accuracy, stochastic gradient descent is used to determine the gradient direction with only one sample, resulting in a solution that is likely to be suboptimal. For the convergence speed, since the stochastic gradient descent method iterates one sample at a time, the iterative direction changes greatly, and it cannot quickly converge to the local optimal solution.
So, is there a middle-of-the-road approach that combines the best of both worlds? have! This is the mini-batch gradient descent method in 4.3.
4.3 Mini-batch Gradient Descent
The mini-batch gradient descent method is a compromise between the batch gradient descent method and the stochastic gradient descent method, that is, for m samples, we use x patterns to iterate, 1<x<m. Generally, x=10 can be taken. Of course, according to the data of the sample, the value of this x can be adjusted. The corresponding update formula is:
θi=θi−α∑j=tt+x−1(hθ(x(j)0,x(j)1,...x(j)n)−yj)x(j)i
Note: Each time a single sample data is passed in, all data sets are traversed, which also belongs to stochastic gradient descent.