Regularization - The problem of overfitting

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第八章《正则化》中第55课时《过拟合问题》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助。
 

By now, you've seen a couple different learning algorithms, linear regression and logistic regression. They work well for many problems, but when you apply them to certain machine learning applications, they can run into a problem called overfitting, that can cause them to perform very poorly. What I'd like to do in this video is explain to you what is this overfitting problem, and in the next few videos after this, we'll talk about a technique called regularization, that will allow us to ameliorate or to reduce the overfitting problem, and get these learning algorithms to maybe work much better. So what is overfitting?

Let's keep using our running example of predicting housing prices with linear regression, where we want to predict the price as a function of the size of the house. One thing we could do is fit a linear function to this data, and if we do that, maybe we get that sort of straight line fit to the data. But this isn't a very good model. Looking at the data, it seems pretty clear that as the size of the house increases, the housing prices plateaued, or kind of flattens out as we move to the right, and so this algorithm does not fit the training set very well, and we call this problem underfitting. Another term for this is that this algorithm has high bias. Both of these roughly mean that it's just not even fitting the training data very well. The term bias is kind of a historical or technical one, but the idea is that if fitting a straight line to the data, then, it's as if the algorithm has a very strong preconception, or a very strong bias that housing prices are going to vary linearly with their size. And despite the data to the contrary, despite the evidence of the contrary, its preconceptions still or bias still causes it to fit a straight line, and this ends up being a poor fit to the data. Now, in the middle, we could fit a quadratic functions to the data, and with this data set, we fit the quadratic function, maybe we get that kind of curve, and that works pretty well. And, at the other extreme, would be if we were to fit, say a fourth order polynomial to the data, so, we have five parameters, \theta _{0} through \theta _{4}. And with that, we can actually fit a curve that passes through all five of our training examples. You might get a curve that looks like this, that, on the one hand, seems to do a very good job fitting the training set, and it passes through all of my data, at least. But this is still a very wiggly curve, right? So, it's going up and down all over the place, and we don't actually think that's such a good model for predicting housing prices. So, this problem, we call overfitting. And another term for this is that this algorithm has high variance. The term high variance is another historical or technical one, but the intuition is that if we're fitting such a high order polynomial, then the hypothesis can fit, you know, it's almost as if it can fit almost any function. And this face of possible hypothesis is just too large, it's too variable. And we don't have enough data to constrain it to give us a good hypothesis, so that's called overfitting. And in the middle, there isn't really a name, but I'm just going to write, you know, just right. Where a second degree polynomial, quadratic function, seems to be just right to fitting this data. To recap a bit, the problem of overfitting comes when if we have too many features, then the learn hypothesis may fit the training set very well. So, your cost function may actually be very close to 0, but you may then end up with a curve like this, that, you know tries hard to fit the training set, so that it even fails to generalize to new examples, and fails to predict prices on new examples as well. And here the term generalize refers to how well a hypothesis applies even to new examples, that is to data to houses that it has not seen in the training set. On this slide, we looked at overfitting for the case of linear regression. A similar thing can apply to logistic regression as well.

Here is a logistic regression example with two features x_{1} and x_{2}. One thing we could do is fit logistic regression with just a simple hypothesis like this, where, as usual, g is my sigmoid function. And if you do that, you end up with a hypothesis, trying to use, maybe, just a straight line to separate the positive and negative examples. And this doesn't look a very good fit to the hypothesis. So, once again, this is an example of underfitting or of the hypothesis having high bias. In contrast, if you were to add to your features, these quadratic terms, then you could get a decision boundary, that might look more like this. And, you know, that's a pretty good fit to the data. Probably, about as good as we could get, on this training set. And finally, at the other extreme, if you were to fit a very high order polynomial, if you were to generate lots of high order polynomial terms of features, then logistic regression may contort itself, may try really hard to find a decision boundary that fits your training data, or go to great lengths to contort itself, to fit every single training example well. And, you know, if the features x_{1} and x_{2} offer predicting, maybe, the cancer to the, you know, cancer is malignant, benign breast tumors. This doesn't, this really doesn't look like a very good hypothesis, for making predictions. And so, once again, this is an instance of overfitting, and of a hypothesis having high variance and not really, and being unlikely to generalize well to new examples.

Later in this course, when we talk about debugging and diagnosing things that can go wrong with learning algorithms, we'll give you specific tools to recognize when overfitting, and also when underfitting may be occurring. But for now, let's talk about the problem of, if we think overfitting is occurring, what can we do to address it. In the previous examples, we had one or two dimensional data, so, we could just plot the hypothesis and see what was going on, and select the appropriate degree of polynomial. So, earlier for the housing prices example, we could just plot the hypothesis, and you know, maybe see that it was fitting the sort of very wiggly function that goes all over the place to predict housing prices. And we could then use figures like these to select an appropriate degree polynomial. So plotting the hypothesis, could be one way to try to decide what degree polynomial to use. But that doesn't always work. And in fact more often, we may have learning problems that we just have a lot of features. And there is not just a matter of selecting what degree polynomial. And, in fact, when we have so many features, it also becomes much harder to plot the data and it becomes much harder to visualize it to decide what features to keep or not. So concretely, if we're trying to predict housing prices, sometimes we can just have a lot of different features. And all of these features seem, you know, maybe they seem kind of useful. But, if we have a lot of features, and very little training data, then overfitting can become a problem.

In order to address overfitting, there are two main options for things that we can do. The first option is to try to reduce the number of features. Concretely, one thing we could do is manually look through the list of features, and use that to try to decide which are the more important features, and therefore which are the features we should keep, and which are the features we should throw out. Later in this course, we'll also talk about model selection algorithms, which are algorithms for automatically deciding which features to keep, and which features to throw out. This idea of reducing the number of features can work well, and can reduce over fitting. And, when we talk about model selection we'll go into this in much greater depth. But, the disadvantage is that, by throwing away some of the features, is also throwing away some of the information you have about the problem. For example, maybe, all of those features are actually useful for predicting the price of the house, so maybe we don't actually want to throw some of our information or throw some of our features away. The second option, which we'll talk about in the next few videos, is regularization. Here, we're going to keep all the features, but we're going to reduce the magnitude or the values of the parameters \theta _{j}. And, this method works well, we'll see. When we have a lot of features, each of which contributes a little bit to predicting the value of y, like what we saw in the housing predicting example, where we could have a lot of features, each of which, you know, somewhat useful, so, maybe, we don't want to throw them away.

So, this subscribes the idea of regularization at a very high level. And, I realize that, all of these details probably don't make sense to you yet. But in the next video, we'll start to formulate exactly how to apply regularization, and exactly what regularization means. And then we'll start to figure out, how to use this, to make our learning algorithms work well and avoid overfitting.

<end>

发布了41 篇原创文章 · 获赞 12 · 访问量 1306

猜你喜欢

转载自blog.csdn.net/edward_wang1/article/details/105221308