batch gradient descent

1. Development history

The gradient descent method is an optimization algorithm that was proposed as early as the 1950s. Initially, the gradient descent method was used to solve the linear regression problem, that is, given a data set, find the best straight line to fit the data. Over time, gradient descent has been widely used in various machine learning problems such as classification, clustering, dimensionality reduction, etc.

Batch gradient descent method (Batch Gradient Descent) is a form of gradient descent method, which is one of the earliest proposed gradient descent algorithms. The basic idea of ​​the batch gradient descent method is to update the model parameters by calculating the gradient of all samples to achieve the purpose of minimizing the loss function. The history of batch gradient descent can be traced back to the 1950s.

With the advancement of computer technology, the calculation speed of the batch gradient descent method has been greatly improved, but when dealing with large-scale data, its calculation load is still very large, resulting in too long training time. Therefore, people began to study more efficient gradient descent algorithms, such as Stochastic Gradient Descent (SGD) and Mini-batch Gradient Descent.

2. Basic principles

  1. Gradient Descent

The gradient descent method is an iterative algorithm. The basic idea is to minimize the loss function by calculating the gradient of the current point and updating the parameters in the opposite direction of the gradient in each iteration. This process can be described as the following formula:

θ = θαJ ( θ )

Among them, θ represents the model parameters, α represents the learning rate, and ∇ J ( θ ) represents the gradient of the loss function J ( θ ) to the parameter θ .

  1. batch gradient descent

The batch gradient descent method is a form of the gradient descent method. The basic principle is that in each iteration, the gradient of all samples is calculated and the parameters are updated in the opposite direction of the gradient. This process can be described as the following formula:

θ = θαm 1 i =1∑ mJi ( θ )

Among them, m represents the number of samples, and ∇ Ji ( θ ) represents the gradient of the loss function J ( θ ) to the parameter θ of the i -th sample .

The advantage of the batch gradient descent method is that the convergence speed is fast, but its disadvantage is that when dealing with large-scale data, the amount of calculation is large, resulting in a long training time.

3. Mathematical thinking

The mathematical idea of ​​the batch gradient descent method is to use the gradient information to update the model parameters in order to minimize the loss function. Specifically, in each iteration, the gradient of all samples is calculated, and then the parameters are updated in the opposite direction of the gradient, so that the value of the loss function is continuously reduced until it converges.

4. Mathematical principles

The mathematical principle of the batch gradient descent method can be explained from the following aspects.

  1. The direction of the gradient is the direction in which the value of the function decreases the fastest

In a multivariate function, the gradient is a vector representing the rate of change of the function at the current point. According to the knowledge of calculus, the direction of the gradient is the direction in which the value of the function decreases the fastest. Therefore, in optimization problems, we can use the gradient information to update the model parameters in order to minimize the loss function.

  1. The choice of learning rate has an impact on the convergence speed and stability of the algorithm

The learning rate is an important parameter in the batch gradient descent method, which controls the update range of the parameters in each iteration. If the learning rate is too large, the parameters will oscillate back and forth during the update process, and even fail to converge; if the learning rate is too small, the convergence speed will be too slow, requiring more iterations to reach the minimum value. Therefore, the choice of learning rate has a great influence on the convergence speed and stability of the algorithm.

  1. The choice of loss function has a great influence on the effect of the algorithm

The loss function is an important concept in the batch gradient descent method, which is used to measure the gap between the model's predicted value and the real value. Different problems need to choose different loss functions to achieve the purpose of minimizing the loss function. For example, the loss function commonly used in regression problems is Mean Squared Error (Mean Squared Error, MSE), and the loss function commonly used in classification problems is Cross Entropy (Cross Entropy).

5. Research masterpieces

The batch gradient descent method is one of the earliest optimization algorithms proposed in the field of machine learning. It is widely used and has many related research masterpieces. The following are some representative literatures:

  1. Rosenblatt F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological review, 1958, 65(6): 386.

  1. Rumelhart D E, Hinton G E, Williams R J. Learning representations by back-propagating errors. Nature, 1986, 323(6088): 533-536.

  1. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436-444.

  1. Bottou L. Online learning and stochastic approximations. Online Learning and Neural Networks, 1998, 17(9): 1429-1445.

  1. Kingma D P, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

  1. Zeiler M D. ADADELTA: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.

  1. Tieleman T, Hinton G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.

  1. Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 2011, 12(Jul): 2121-2159.

  1. Shalev-Shwartz S, Tewari A. Stochastic methods for l1-regularized loss minimization. Journal of Machine Learning Research, 2011, 12(Mar): 1865-1892.

Guess you like

Origin blog.csdn.net/qq_16032927/article/details/129444549