Linear Regression with multiple variables - Gradient descent in practice II: Learning rate

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第五章《多变量线性回归》中第31课时《多元梯度下降法实践 II: 学习率》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助。
In this video (article), I wanna give you more practical tips for getting gradient descent to work. The ideas in this video (article) will center around the learning rate \alpha.

Concretely, here is the gradient descent update rule and what I want to do in this video is tell you about what I think of as debugging and some tips for making sure that Gradient descent is working correctly. And second, I want to tell you how to choose the learning rate \alpha, at least how I go about choosing it.

Here's something that I often do to make sure gradient descent is working correctly. The job of gradient descent is to find a value of \theta for you that hopefully minimizes the cost function J(\theta ). What I often do is therefore plot the cost function J(\theta ) as gradient descent runs. So, the x-axis here is the number of iterations of gradient descent and as gradient descent runs, you'll hopefully get a plot that maybe looks like this. Notice that the x-axis is the number of iterations. Previously we were looking at plots of J(\theta ) where the x-axis was the parameter vector \theta, but this is not what this is. Concretely, what this point is I'm going to run gradient descent for hundred iterations. And whatever value I get for \theta after a hundred of iterations, I'm going to evaluate the cost function J(\theta ) for the value of \theta, and this vertical height is the value of J(\theta ) I got for the value of \theta. And this point here, that corresponds to the value of J(\theta ) for the \theta that I get after I've run gradient descent for two hundreds of iterations. So what this plot is showing, is it's showing the value of your cost function after each iteration of gradient descent. And, if gradient descent is working properly, then J(\theta ) should decrease after every iteration. And one useful thing that this sort of plot tell you also is that if you look at the specific figure that I've drawn, it looks like by the time you've gotten out to three hundred iterations, between three hundred and four hundred iterations, in this segment, it looks like J(\theta ) hasn't gone down much more. So by the time you get to four hundred iterations, it looks like this curve has flattened out here. And so, we're here at four hundred iterations, it looks like gradient descent has more or less converged because your cost function isn't going down much more. So looking at this figure can also help you judge whether or not gradient descent has converged. By the way, the number of iterations that gradient descent takes to converge for a particular application can vary a lot. So maybe for one application, gradient descent may converge after just thirty iterations, for a different application gradient descent made the 3,000 iterations. For another learning algorithm, it may take 3,000,000 iterations. It turns out to be very difficult to tell in advance how many iterations gradient descent needs to converge, and is usually by plotting this sort of plot. Plotting the cost function as we increase the number of iterations. It's usually by looking at these plots that I tried to tell if gradient descent has converged. It's also possible to come up with automatic convergence test namely to have an algorithm to try to tell you if gradient descent has converged. And here's maybe a pretty typical example of an automatic convergence test and such a test may declare convergence if your cost function J(\theta ) decreases by less than some small value \varepsilon (epsilon), some small value ten to the minus three in one iteration, but I find that usually choosing what this threshold is is pretty difficult. So, in order to check your gradient descent has converged, I actually tend to look at plots like this figure on the left rather than rely on an automatic convergence test. Looking at this sort of figure can also tell you or give you an advanced warning if maybe gradient descent is not working correctly.

Concretely, if you plot J(\theta ) as a function of number of iterations, then, if you see a figure like this, where J(\theta ) is actually increasing, then that gives you a clear sign that gradient descent is not working. And a figure like this usually means that you should be using a smaller learning rate \alpha. If J(\theta ) is actually increasing, the most common cause for that is if you're trying to minimize the function that maybe looks like this. That's if your learning rate is too big then if you start off there, gradient descent may overshoot the minimum, send you there, then if the learning rate is too big, you may overshoot again, and then send you there, and so on. So that what you really wanted was really start here and for to slowly go downhill. But if the learning rate is too big then gradient descent can instead keep on overshooting the minimum so that you actually end up getting worse and worse getting the higher values of the cost function J(\theta ). So you end up with a plot like this. And if you see a plot like this, the fix usually is to just use a smaller value of \alpha. Oh, and also of course make sure that your code doesn't have a bug in it. But usually too large value of \alpha could be a common problem. Similarly, sometimes you may also see J(\theta ) do something like this, it may go down for a while then go up then go down for a while then go up. And to fix something like this is also to use a smaller value of \alpha. I'm not going to prove it here, but underlying assumptions about the cost function J, which does hold true for linear regression. Mathematicians have shown that if your learning rate is small enough then J(\theta ) should decrease on every single iteration. So, if this doesn't happen, probably means \alpha is too big, then you should set a smaller. But of course, you also don't want your learning rate to be too small, because if you do that, then gradient descent can be slow to converge. And if \alpha were too small, you might end up starting out here, say, and, end up taking just minuscule, minuscule baby steps. And just taking a lot of iterations before you finally get to the minimum. And so, if \alpha is too small, gradient descent can make very slow progress and be slow to converge.

To summarize, if the learning rate is too small, you can have a slow convergence problem, and if the learning rate is too large, J(\theta ) maybe not decrease on every iteration and may not even converge. In some cases, if the learning rate is too large, slow convergence is also possible, but the more common problem you see is that J(\theta ) maybe not decrease on every iteration. And in order to debug all of these things, often plotting that J(\theta ) as a function of the number of iterations can help you figure out what's going on. Concretely, what I actually do when I run gradient descent is I would try a range of values. So just try running gradient descent with a range of values for α, like 0.001, 0.01, ..., so these are a factor of 10 differences, and for these different values of α, just plot J(\theta ) as a function of number of iterations, and then pick the value of \alpha that seems to be causing J(\theta ) to decrease rapidly. In fact, what I do actually isn't these steps of ten. So, you know, this is a scale factor of ten if each step up. What I'll actually do is try this range of values and so on where this is, you know, 0.001 then increase the learning rate threefold to get 0.003, and then to step up, this is another roughly threefold increase from 0.003 to 0.01, and these are roughly, you know, trying out gradient descents with each value I try being about 3x bigger than the previous value. So what I do is try a range of values until I've made sure that I've found one value that is too small and made sure I found one value that is too large, and then I try to pick the largest possible value or just something slightly smaller than the largest reasonable value that I found. And when I do that usually it just gives me a good learning rate for my problem. And if you do this too, hopefully you will be able to choose a good learning rate for your implementation of gradient descent.

<end>

发布了41 篇原创文章 · 获赞 12 · 访问量 1306

猜你喜欢

转载自blog.csdn.net/edward_wang1/article/details/103757798