25-stochastic gradient descent method

Stochastic gradient descent

1. Concept

  The gradient descent method we learned before has always been that the loss function we optimize corresponds to θ θ at a certain pointThe gradient value of θ is accurately calculated.
Insert picture description here
  Then it can be seen from the derived formula that in order to find the accurate gradient, each item in this formula must be applied to all samples (mmm ) Perform a calculation. There is a summation calculation before, so such a descent method is usually calledBatch Gradient Descent. In other words, every calculation process must calculate all the information in the sample in batches, but this obviously brings a problem, if ourmmm is very large, that is, if our sample size is very large, then calculating the gradient itself is very time-consuming. So based on this question, are there any improvements?

  In fact, it’s very easy, because every formula is correct for mmm samples have been calculated. In order to get the average, we have to add onemmm , so a naturally derived idea is can we only calculate one of the samples at a time? Based on this idea, we can change the above formula into:
Insert picture description here
  Here, we remove the sum symbol, and for thisiiFor i , we always take a fixed oneiii , correspondingly, we don’t need to divide bymmoutside the bracketsm , just multiply by 2. Of course, we can also vectorize the above formula:
Insert picture description here
  the vectorization method is the same as before, but now for ourX b XbFor X b , we only takeX b Xbeach timeIt is enough to perform calculations on one line of X b . We can use such a formula as the direction of our search, can it? One thing to pay special attention to here is that we are talking aboutthe directionofsearch, not the direction of the gradient, because this formula is no longer the gradient of the loss function, but we observe the gradient formula and imagine that we will randomly take out one every timeiii , for thisiirandomly selectedi , and then use the following formula to calculate:
Insert picture description here
  Thecalculatedby this formula is also a vector, which can also express a direction. We search in this direction and iterate continuously. Can we get the minimum value of our loss function? Then the realization of such an idea is calledStochastic Gradient Descent.

Insert picture description here
  The search process of the stochastic gradient descent method is shown in the figure above. Recall that if we use the gradient descent method, we start from a point outside and move forward in a fixed direction, that is, the minimum value of our entire loss function. However, because the random descent gradient method cannot guarantee that the direction we get every time must be the direction in which our loss function is reduced, and it cannot be guaranteed that it must be the direction with the fastest reduction, so our search path will form such a Path, this is the characteristic of randomness, it is unpredictable. However, the experimental conclusion tells us that even so, we can usually get to the minimum point of our entire loss function through the stochastic gradient descent method, although it may not necessarily come to the minimum point like the gradient descent method. But when our mmIf m is very large, in many cases we may prefer to use stochastic gradient descent toexchange a certain accuracy for a certain time.

  And in the concrete realization, there is a very important technique. In the process of our stochastic gradient descent method, this learning rate ( η ηThe value of η ) becomes very important, because if our learning rate is always a fixed value in the process of our stochastic gradient descent method, it is very likely that our stochastic gradient descent method has come to this to a certain extent The position around the center of the minimum is now, but because this random process is not good enough, ourη ηη is a fixed value again, slowly jumping out of the position of the minimum value, so in practice we hope that our learning rate in the stochastic gradient descent method isgradually decreasing, then we can design a function to let Our learning rate (η ηη ) As the number of cycles of the gradient descent method increases, the correspondingη ηThe value of η is getting smaller and smaller, you may have thought of it, the easiest way is to use the reciprocal form (i_iters represents the number of cycles):
Insert picture description here
  However, such an implementation sometimes has some problems, when we have fewer cycles At the time, our η dropped too fast. For example, if i_iters changes from 1 to 2, ourη ηη drops by 50% at once. If i_iters increases from 10000 to 10001, ourη ηη only dropped by one ten thousandth, then before and afterη ηThe ratio of the decrease in η value is too different, so we usually add a constantbb to thedenominator when we implement it.b , I usually givebblaterThe value of b is 50:
Insert picture description here
  at the position of the molecule, if we fix it to 1, sometimes it may not achieve the effect we want, so we let our molecule also take a constantaaa :
Insert picture description here
  So, we can addaaa andbbb is regarded as the two hyper-parameters corresponding to the stochastic gradient descent method, but in the subsequent learning, these two hyper-parameters are not adjusted, we choose an empirically better value,aaa = 5, b b b = 50. But anyway, in order to get better convergence results in the stochastic gradient descent method, our learning rate should gradually decrease as the number of cycles increases.

  In fact, this gradually decreasing idea is the idea of simulated annealing . The so-called simulated annealing idea is actually simulating our situation in nature. For example, to build steel requires fire smelting. In this process, the temperature of fire smelting is gradually cooled from high to low, which is the so-called annealing. Process, then this cooling function is and time ttt related. Therefore, sometimes, we also changeη ηThe value of η is written as:
Insert picture description here


2. Realization

  I believe I have already introduced the principle of stochastic gradient descent clearly. Let's take a look at the specific code to implement it.

Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here


  In my next blog, I will encapsulate the training process of our stochastic gradient descent method on our linear regression algorithm, and in sklearn sklearnThe stochastic gradient descent method is used in s k l e a r n .

  For the specific code, see 25 Stochastic Gradient Descent Method.ipynb

Guess you like

Origin blog.csdn.net/qq_41033011/article/details/109099363
Recommended