[2014 Stanford Machine Learning tutorial notes] Chapter IV - Multivariate gradient descent drill Ⅱ: learning rate

    In this section, we will focus on the learning rate α.

    This is the gradient descent algorithm update rule. In this section, we will learn commissioning (Debugging) is and some tips to ensure that during normal gradient descent method is working. In addition, we will also learn how to choose the learning rate α.

    Gradient descent algorithm things done is to find a θ value for you, and hope that it will be able to minimize the cost function J (θ). Thus, typically when the gradient descent algorithm is run, the value of the cost function delineate J (θ) is.

                                        

    Here the x-axis represents the parameter vector [theta] no longer, where the x-axis represents the number of iterative gradient descent algorithm , running with gradient descent algorithm, we may get a curve as shown above the right.

    When the iteration number 100, corresponding to the value of the function is obtained after 100 iterations of the calculated [theta] J (θ). So this function represents the: value of the cost function gradient descent after each iteration step. If the gradient descent algorithm work, then, after each step iteration J (θ) should decline. One use of this curve is that it can tell you when to start J (θ) the rate of decline began to slow, and if we begin by observing the flattening curve to determine whether the gradient descent algorithm has almost restrained . By the way, corresponds to each particular issue, the number of iterations required gradient descent algorithm may vary greatly . Therefore, it may be just 30 steps to converge to a certain problem, question the need for another 30,000 steps. Therefore, it is difficult to determine in advance how many steps the gradient descent algorithm needs to converge.

    Alternatively, you can make some automatic convergence test. That it is, for some algorithms to tell you whether the gradient descent algorithm has converged. Below is an automatic convergence test a classic example.

    If the cost function J (θ) is less than the iteration drops a small value [epsilon], then the test function is determined to have converged ([epsilon] may be 10 -3 ). Typically, however, to choose a suitable ε is quite difficult. Therefore, in order to check whether the gradient descent algorithm converges, we still see by the graph to be a little bit better.

    See graph Another advantage is: it can tell you in advance algorithm does not work properly. Specifically, if you draw following figures show

    J (θ) on the rise, suggesting that gradient descent algorithm does not work properly. And this curve usually means that you should use a smaller learning rate α. If J (θ) on the rise, you typically attempt to minimize indicate one of the following (left) function. However, because the learning rate α is too large, with the increase in the number of iterations, we are getting away from the minimum value, as shown below (right).

                                      

    Therefore, the solution to this situation is to use a smaller value of α. Of course, also make sure that your code is not bug.

    Also, sometimes we will see the following image

    This solution is also the use of a smaller α value.

    Mathematicians have shown that when the value of α is sufficiently small, each iteration after the cost function J (θ) will decrease. Therefore, if the cost function did not decline, then it is most likely because the value of the learning rate α is too large. At this time, we will choose a smaller value.

    However, we do not want the value of the learning rate α is too small, as this may lead to gradient descent algorithm converges slowly. So, we will be able to reach many times iteration lowest point.


to sum up:

  • If the value of the learning rate α is too small, it may lead to gradient descent algorithm converges slowly.
  • If the value of the learning rate α is too large, the cost function J (θ) may not have declined in each iteration, may not even converge.
  • We observe the operation of the gradient descent algorithm by drawing a graph.
  • When run a gradient descent algorithm, we typically try a range of α values, such as: 0.001, (0.003), 0.01, (0.03), 0.1, (0.3), 1 and the like. Then the iteration number curve with the change in the cost function in accordance with these values ​​plotted J (θ). Then select a rapid decline in the value of α so that the cost function J (θ).

Guess you like

Origin www.cnblogs.com/shirleyya/p/12607033.html