Gradient descent and its Python implementation

The gradient descent method, also known as the steepest descent method, is the most commonly used method to solve unconstrained optimization problems. It is an iterative method. The main operation of each step is to solve the gradient vector of the objective function, and the The negative gradient direction of the current position is used as the search direction (because the objective function descends the fastest in this direction, which is also the origin of the name of the steepest descent method).
Gradient descent method features: the closer to the target value, the smaller the step size, the slower the descent speed.
Intuitively, it is shown in the following figure:

Here, each circle represents a function gradient, and the most center represents the function extreme point. Each iteration finds a new position according to the gradient obtained from the current position (used to determine the search direction and the forward speed together with the step size) and the step size. , so that the continuous iteration finally reaches the local optimum point of the objective function (if the objective function is a convex function, the global optimum point is reached).


Next, we will use the formula to explain the gradient descent method.
The following h(θ) is our fitting function


and can also be expressed in the form of a vector:


The following function is the risk function that we need to optimize, each of which is Represents the residual between our fitting function and y on the existing training set, and calculates its squared loss function as the risk function we build (see least squares and its Python implementation).


Here we multiply by 1/2 to be In order to make the result more concise when calculating the partial derivative later, the reason why it can be multiplied by 1/2 is that it has no effect on the optimal value of the risk function after multiplying by this coefficient.
Our goal is to minimize the risk function, so that our fitting function can fit the objective function y to the greatest extent, that is,


the specific gradient solutions that follow are all based on this objective.


Batch gradient descent BGD According to the traditional idea, we need to find its partial derivative for
each of the above risk functions, and obtain each corresponding gradient . Here , it represents the jth component of the ith sample point , that is, the connection in h(θ) Since we want to minimize the risk function, we update each parameter according to the negative gradient direction of each parameter. Here α represents the step size of each step. It can be noticed from the above formula that it obtains a global optimal solution, but each time In one iteration, all the data in the training set must be used. If m is very large, then the iteration speed of this method can be imagined! ! So, this introduces another method, stochastic gradient descent.














Stochastic Gradient Descent SGD
Because the iteration speed of batch gradient descent is very slow when the training set is large, it is not feasible to use batch gradient descent to solve the optimization problem of the risk function in this case. Here In this case, it is proposed - Stochastic Gradient Descent
We rewrite the above risk function into the following form:


Among them, the loss function


called the sample point Next, we take the loss function of each sample and find its partial derivative for each, and get Each corresponding gradient is then updated according to the negative gradient direction of each parameter . Compared with batch gradient descent, stochastic gradient descent uses only one sample per iteration. In the case of large sample size, the common situation is Only a part of the sample data is used to iterate θ to the optimal solution. Therefore, stochastic gradient descent is significantly less computationally expensive than batch gradient descent. A disadvantage of SGD is that it has more noise than BGD, so that SGD does not move towards the overall optimization direction every iteration. Moreover, because SGD uses one sample for iteration each time, the final optimal solution is often not the global optimal solution, but only the local optimal solution. However, the direction of the large whole is towards the global optimal solution, and the final result is often near the global optimal solution. The following is a graphical display of the two methods: As can be seen from the above graph, SGD uses a sample point for gradient search each time, so its optimization path looks blind (this is also the origin of the name of stochastic gradient descent). The advantages and disadvantages are as follows: batch gradient descent: advantages: global optimal solution; easy to implement in parallel; few overall iterations Disadvantages: when the number of samples is large, the training process will be very slow, and each iteration will take a lot of time. Stochastic Gradient Descent: Advantages: Fast training speed, less computation per iteration




























Disadvantages: The accuracy is reduced, and it is not globally optimal; it is not easy to implement in parallel; the overall number of iterations is relatively large.



=========== Segmentation =============
Above we explained what gradient descent is and how to solve gradient descent. Next, we will implement it through Python Gradient descent method.

[python]  view plain copy  
 
  1. # _*_ coding: utf-8 _*_  
  2. # Author: yhao  
  3. # Blog: http://blog.csdn.net/yhao2014  
  4. # Email: [email protected]  
  5.   
  6. # Training set  
  7. # Each sample point has 3 components (x0,x1,x2)  
  8. x = [(1, 0., 3), (1, 1., 3), (1, 2., 3), (1, 3., 2), (1, 4., 4)]  
  9. # y[i] The output corresponding to the sample point  
  10. y = [ 95.364,  97.217205,  75.195834,  60.105519,  49.342380]  
  11.   
  12. # Iteration threshold, stop iteration when the difference between the two iteration loss functions is less than the threshold  
  13. epsilon = 0.0001  
  14.   
  15. # learning rate  
  16. alpha = 0.01  
  17. diff = [0, 0]  
  18. max_itor = 1000  
  19. error1 = 0  
  20. error0 = 0  
  21. cnt = 0  
  22. m = len(x)  
  23.   
  24.   
  25. # Initialization parameters  
  26. theta0 =  0  
  27. theta1 = 0  
  28. theta2 = 0  
  29.   
  30. while True:  
  31.     cnt += 1  
  32.   
  33.     # parameter iterative calculation  
  34.     for i in range(m):  
  35.         # The fitting function is y = theta0 * x[0] + theta1 * x[1] +theta2 * x[2]  
  36.         # Calculate residuals  
  37.         diff[0] = (theta0 + theta1 * x[i][1] + theta2 * x[i][2]) - y[i]  
  38.   
  39.         # gradient = diff[0] * x[i][j]  
  40.         theta0 -= alpha * diff[0] * x[i][0]  
  41.         theta1 -= alpha * diff[0] * x[i][1]  
  42.         theta2 -= alpha * diff[0] * x[i][2]  
  43.   
  44.     # Calculate the loss function  
  45.     error1 = 0  
  46.     for lp in range(len(x)):  
  47.         error1 += (y[lp]-(theta0 + theta1 * x[lp][1] + theta2 * x[lp][2]))**2/2  
  48.   
  49.     if abs(error1-error0) < epsilon:  
  50.         break  
  51.     else:  
  52.         error0 = error1  
  53.   
  54.     print ' theta0 : %f, theta1 : %f, theta2 : %f, error1 : %f' % (theta0, theta1, theta2, error1)  
  55. print 'Done: theta0 : %f, theta1 : %f, theta2 : %f' % (theta0, theta1, theta2)  
  56. print  'Number of iterations: %d' % cnt  


Result (truncated part):

[plain]  view plain copy  
 
  1.  theta0 : 2.782632, theta1 : 3.207850, theta2 : 7.998823, error1 : 7.508687  
  2.  theta0 : 4.254302, theta1 : 3.809652, theta2 : 11.972218, error1 : 813.550287  
  3.  theta0 : 5.154766, theta1 : 3.351648, theta2 : 14.188535, error1 : 1686.507256  
  4.  theta0 : 5.800348, theta1 : 2.489862, theta2 : 15.617995, error1 : 2086.492788  
  5.  theta0 : 6.326710, theta1 : 1.500854, theta2 : 16.676947, error1 : 2204.562407  
  6.  theta0 : 6.792409, theta1 : 0.499552, theta2 : 17.545335, error1 : 2194.779569  
  7.  theta0 : 74.892395, theta1 : -13.494257, theta2 : 8.587471, error1 : 87.700881  
  8.  theta0 : 74.942294, theta1 : -13.493667, theta2 : 8.571632, error1 : 87.372640  
  9.  theta0 : 74.992087, theta1 : -13.493079, theta2 : 8.555828, error1 : 87.045719  
  10.  theta0 : 75.041771, theta1 : -13.492491, theta2 : 8.540057, error1 : 86.720115  
  11.  theta0 : 75.091349, theta1 : -13.491905, theta2 : 8.524321, error1 : 86.395820  
  12.  theta0 : 75.140820, theta1 : -13.491320, theta2 : 8.508618, error1 : 86.072830  
  13.  theta0 : 75.190184, theta1 : -13.490736, theta2 : 8.492950, error1 : 85.751139  
  14.  theta0 : 75.239442, theta1 : -13.490154, theta2 : 8.477315, error1 : 85.430741  
  15.  theta0 : 97.986390, theta1 : -13.221172, theta2 : 1.257259, error1 : 1.553781  
  16.  theta0 : 97.986505, theta1 : -13.221170, theta2 : 1.257223, error1 : 1.553680  
  17.  theta0 : 97.986620, theta1 : -13.221169, theta2 : 1.257186, error1 : 1.553579  
  18.  theta0 : 97.986735, theta1 : -13.221167, theta2 : 1.257150, error1 : 1.553479  
  19.  theta0 : 97.986849, theta1 : -13.221166, theta2 : 1.257113, error1 : 1.553379  
  20.  theta0 : 97.986963, theta1 : -13.221165, theta2 : 1.257077, error1 : 1.553278  
  21. Done: theta0 : 97.987078, theta1 : -13.221163, theta2 : 1.257041  
  22. Iterations: 3443  


You can see the final convergence to stable parameter values.

 

Note: You need to be careful when choosing alpha and epsilon here, and the values ​​that may be uncomfortable will lead to failure to converge in the end.

 

Reference documentation:

Formula comparison and implementation comparison between Stochastic gradient descent and Batch gradient descent

Stochastic Gradient Descent
Python implements gradient descent algorithm

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325250013&siteId=291194637