Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第五章《多变量线性回归》中第30课时《多元梯度下降法实践 I: 特征缩放》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助。

In this video(article) and the video (article) after this one, I wanna tell you about some of the practical tricks for making gradient descent work well. In this video (article), I want to tell you about an idea called feature scaling. Here is the idea.

If you have a problem where you have multiple features, if you make sure that the features are on a similar scale, by which I mean make sure that the different features take on similar range of values, then gradient descent can converge more quickly. Concretely, let's say you have a problem with two features where x_{1} is the size of house and takes on values between say 0 to 2000, and x_{2} is the number of bedrooms and maybe that takes on values between 1 and 5. If you plot the contours of the cost function J(\theta ), then the contours may look like this, where, let's see, J(\theta ) is a function of parameters \theta _{0}, \theta _{1} and \theta _{2}. I'm going to ignore \theta _{0}, so let's forget about \theta _{0} and pretend as a function of only \theta _{1} and \theta _{2}. If x_{1} can take on, you know, much larger range of values than x_{2}, it turns out that the contours of the cost function J(\theta ) can take on this sort of very very skewed elliptical shape, except that with the 2000 to 5 ratio, it can be even more skewed. So, this is very very tall and skinny ellipses, or these very tall skinny ovals, can form the contours of the cost function J(\theta ). And if you run gradient descents on this sort of cost function, your gradients may end up taking a long time, and can oscillate back and forth and take a long time before it can finally find its way to the global minimum. In fact, you can imagine if these contours are exaggerated even more when you draw incredibly tall skinny contours, and it can be even more extreme than that, then it turns out gradient descent will just have a much harder time taking its way, meandering around, it can take a long time to find its way to the global minimum.

In these settings, a useful thing to do is to scale the features. Concretely, if you instead define the feature x_{1} to be the size of the house divided by 2000, and define x_{2} to be maybe the number of bedrooms divided by 5, then the contours of the cost function J(\theta ) can become more less skewed, so the contours may look more like circles. And if you run gradient descent on a cost function like this, then gradient descent, you can show mathematically, can find a much more direct path to the global minimum, rather than taking a much more convoluted path trying to follow a much more complicated trajectory to get to the global minimum. So, by scaling the features so that they take on similar ranges of values. In this example, we end up with both features, x_{1} and x_{2}, between 0 and 1. You can wind up with an implementation of gradient descent. They can converge much faster.

More generally, when we're performing features scaling, what we often want to do is get every feature into approximately a -1 to +1 range. And concretely, your feature x_{0} is always equal to 1. So, that's already in that range, but you may end up with dividing other features by different numbers to get them into this range. And the numbers -1 and +1 aren't too important. So, if you have a feature x_{1} that winds up being between 0 and 3, say, that's not a problem. If you end up having a different feature that winds up being between -2 and +0.5, again, this is close enough to -1 and +1, that's fine. It's only if you have a different feature x_{3}, say, that ranges from -100 to +100, then this is a very different values than-1 and +1. So, this might be a less well-scaled feature. And similarly, if your features take on a very, very small range of values, so if x_{4} takes on values between -0.0001 and +0.0001, then, again this takes on a much smaller range of values than -1 and +1 range. And again I would consider this feature poorly scaled. So you want the range of values, it can be bigger than +1 or smaller than +1, but just not much bigger, like +100 here, or too much smaller like 0.0001 one over there. Different people have different rules of thumb. But the one that I use is that if a feature takes on the range of values from say-3 to +3, I usually think that should be just fine, but maybe it takes on much larger values than +3 or -3 I start to worry. And if it takes on values from say -1/3 to 1/3. You know I think that's fine too, or 0 to 1/3 or -1/3 to 0. I guess that's typical range of values I consider okay. But if it takes on a much tinier range of values like x_{4} here then again you start to worry. So, the take-home message is don't worry if your features are not exactly on the same scale or exactly in the same range of values. But so long as they're all close enough to this gradient descent should work okay.

In addition to dividing by the maximum value when performing feature scaling, sometimes people will also do what's called the mean normalization. And what I mean by that is that you want to take a feature x_{i} and replace it with x_{i}-\mu _{i} to make your features to have approximately 0 mean. And obviously we won't apply this to feature x_{0}, because the feature x_{0} is always equal to 1 so it cannot have an average value of 0. But, concretely, for other features if the range of sizes of the house takes on values between 0 to 2000 and, if you know, the average size of a house is equal to 1000. Then you might use this formula, set the feature x_{1}=\frac{size-1000}{2000}. And similarly, if your houses have one to five bedrooms and if on average a house has two bedrooms, then you might use this formula to mean-normalize your second feature x_{2}=\frac{\#bedrooms-2}{5}. In both of these cases, you therefore end up with features x_{1} and x_{2}. They can take on values roughly between -0.5 and +0.5. Exactly not true - x_{2} can actually be slightly larger than 0.5, but you know, close enough.

And the more general rule is that you might take a feature x_{1} and replace it with \frac{x_{1}-\mu _{1}}{s_{1}} where to define these terms:

- \mu _{1} is the average value of x_{1} in the training sets,

- and s_{1} is the range of values of that feature, and by range, I mean the maximum value minus the minimum value, or for those of you that understand the deviation of the variable, setting s_{1} to be the standard deviation of the variable would be fine, too. But taking this max minus min would be fine.

And similarly for the second feature x_{2}, you can replace x_{2} with \frac{x_{2}-\mu _{2}}{s_{2}}. And this sort of formula will get your features, you know, maybe not exactly, but maybe roughly into these sort of ranges. By the way, for those of you that are being super careful, technically if we're taking the range as max minus min, this 5 here will actually become a 4. So if max is 5, min is 1, then the range of their own values is actually equal to 4, but all of these are approximate and any value that gets the features into anything close to these sorts of ranges will do fine. And the feature scaling doesn't have to be too exact, in order to get gradient descent to run quite a lot faster.

So, now you know about feature scaling and if you apply this simple trick, it can make gradient descent run much faster and converge in a lot fewer iterations. In the next video, I will tell you about another trick to make gradient descent work well in practice.

<end>

发布了41 篇原创文章 · 获赞 12 · 访问量 1306

猜你喜欢

转载自blog.csdn.net/edward_wang1/article/details/103727323
今日推荐