Regularization - Cost function

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程，第八章《正则化》中第56课时《代价函数》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正，使其更加简洁，方便阅读，以便日后查阅使用。现分享给大家。如有错误，欢迎大家批评指正，在此表示诚挚地感谢！同时希望对大家的学习能有所帮助。

In this video, I'd like to convey to you, the main intuitions behind how regularization works. And we'll also write down the cost function that we'll use when we were using regularization. With the hand drawn examples that we have on these slides, I think I'll be able to convey part of the intuition. But an even better way to see for yourself, how regularization works, is if you implement it,and to see it work for yourself. And if you do the appropriate exercises after this, you get the chance to see regularization in action for yourself. So, here's the intuition.

In the previous video, we saw that if we were to fit a quadratic function to this data, it gives us a pretty good fit to this data. Whereas, if we were to fit an overly high order degree polynomial, we end up with a curve that may fit the training set very well, but really not be a good hypothesis, but overfit the data poorly, and not generalize well. Considering the following. Suppose we were to penalize, and make the parameters $\theta _{3}$ and $\theta _{4}$ really small. Here's what I mean. Here is our optimization objective, or here is optimization problem, where we minimize our usual square error cost function. Let's say I take this objective and modify it, and add to it plus 1000x $\theta _{3}^{^{2}}$ plus 1000x $\theta _{4}^{2}$ . 1000 I am just writing down as some huge number. Now, if we were to minimize this function, the only way to make this new cost function small is the $\theta _{3}$ and $\theta _{4}$ are small, right? Because otherwise, if you have 1000x $\theta _{3}$ , this new cost function gonna be big. So when we minimize this new function, we are going to end up with $\theta _{3}$ close to 0, and $\theta _{4}$ close to 0. And that's as if we're getting rid of these two terms over there. And if we do that, well then, if $\theta _{3}$ and $\theta _{4}$ are close to 0, then we are basically left with a quadratic function. And so we end up with a fit to the data, that's quadratic function plus maybe, tiny contributions from small terms, $\theta _{3}$ and $\theta _{4}$ that they may be very close to 0. And so, we end up with essentially a quadratic function, which is good, because this is a much better hypothesis. In this particular example, we looked at the effect of penalizing, two of the parameter values being large. More generally, here is the idea behind regularization.

The idea is that, if we have small values for the parameters, then having small values for the parameters will somehow, will usually correspond to having a simpler hypothesis. So, for our last example, we penalize just $\theta _{3}$ and $\theta _{4}$ , and when both of these were close to 0, we wound up with a much simpler hypothesis that was essentially a quadratic function. But more broadly, if we penalize all the parameters, usually that we can think of that as trying to give us a simpler hypothesis as well, because when these parameters are close to 0, in this example, that gives us a quadratic function. But more generally, it is possible to show that having smaller values of parameters corresponds to usually smoother functions as well also the simpler. And which are therefore, also less prone to overfitting. I realize that the reasoning for why having all the parameters be small, why that corresponds to a simpler hypothesis, I realize that reasoning may not be entirely clear to you right now. And it is kind of hard to explain, unless you implement yourself and see it for yourself. But I hope that the example of having $\theta _{3}$ and $\theta _{4}$ be small and how that gave us a simpler hypothesis, I hope that helps explain why, at least gave some intuition, as to why this might be true. Let's look at the specific example. For housing price prediction, we may have 100 features that we talked about, where maybe $x_{1}$ is the size, $x_{2}$ is number of bedrooms, and $x_{3}$ is number of floors and so on. And we may have 100 features. And unlike the polynomial example, we don't know that $\theta _{3}$ and $\theta _{4}$ are the high order polynomial terms. So, if we have just a bag, if we have just a set of 100 features, it's hard to pick in advance which are the ones that are less likely to be relevant. So we have 101 parameters. And we don't know which ones to pick, we don't know which parameters to pick, to try to shrink. So, in regularization, what we're going to do, is take our cost function, here's my cost function for linear regression, and what I'm going to do is modify this cost function to shrink all of my parameters. Because you know I don't know which one or two to try to shrink. So I am going to modify my cost function to add a term at the end. Like so. We have square brackets here as well. When I add an extra regularization term at the end to shrink every single parameter, and so this term were tend to shrink all of my parameters $\theta _{1}$ , $\theta _{2}$ , $\theta _{3}$ through $\theta _{100}$ . By the way, by convention, the summation here starts from 1, so I am not actually going to penalize $\theta _{0}$ being large. That sort of convention that the sum is from i=1 through n, rather than i=0 through n. But in practice, it makes very little difference. Whether you include $\theta _{0}$ or not, in practice, make very little difference to results. But by convention, usually, we regularize only $\theta _{1}$ though $\theta _{100}$ .

Writing down our regularized optimization objective, our regularized cost function again. Here it is. Here is $J(\theta )$ where this term on the right is a regularization term. And $\lambda$ here, is called the regularization parameter. And what $\lambda$ does, is controls a trade off between two different goals. The first goal, captured by the first term of the objective, is that we would like to train, is that we would like to fit the training data well, we would like to fit the training set well. And the second goal is we want to keep the parameters small, and that's captured by the second term, by the regularization objective. And what $\lambda$ , the regularization parameter does is the controls of the trade off between these two goals, between the goal of fitting the training set well, and the goal of keeping the parameter small, and therefore keeping the hypothesis relatively simple to avoid overfitting. For our housing price prediction example, whereas previously, if we had fit a very high order polynomial, we may have wound up with a very, a wiggly or curvy function like this. If you still fit a high order polynomial with all the polynomial features in there, but instead, you just make sure to to use this sort of regularized objective, then what you can get out is in fact, a curve that isn't quite a quadratic function, but is much smoother and much simpler. And maybe a curve like the magenta line that fits, that gives a much better hypothesis for this data. Once again, I realize it can be a bit difficult to see why shrinking the parameters can have this effect, but if you implement this algorithm yourself, with regularization, you will be able to see this effect firsthand.

In regularized linear regression, if the regularization parameter $\lambda$ is set to be very large, then what will happen is we will end up penalizing the parameters $\theta _{1}$ , $\theta _{2}$ , $\theta _{3}$ , $\theta _{4}$ very highly. That is, if our hypothesis is this one, down at the bottom, and we end up penalizing $\theta _{1}$ , $\theta _{2}$ , $\theta _{3}$ , $\theta _{4}$ very heavily, then we end up with all of these parameters close to 0, right? And if we do that, it's as if we're getting rid of these terms in the hypothesis. So that we just left with a hypothesis, that will say that housing prices are equal to $\theta _{0}$ . And that is akin to fitting a flat horizontal straight line to the data. And this is an example of underfitting. And in particular this hypothesis, this straight line is just fails to fit the training set well. It's just a flat straight line, it doesn't go near, it doesn't go anyway near most of the training examples. And another way of saying this is that, this hypothesis has too strong a preconception, or too high bias that housing prices are just equal to $\theta _{0}$ , and despite the clear data to the contrary, you know, chooses to fit a sort of flat line, just a flat horizontal line. I didn't draw that very well. This is just a horizontal flat line to the data. So for regularization to work well, some care should be taken, too take a good choice for the regularization parameter $\lambda$ as well. And when we talk about multi-selection later in this course, we'll talk about a variety of ways, for automatically choosing the regularization parameter $\lambda$ as well. So that's the idea of high regularization, and the cost function reviews in order to use regularization. In the next two videos, let's take these ideas and apply them to linear regression and to logistic regression, so that we can then get them to avoid overfitting problems.

<end>

王彩旗

发布了41 篇原创文章 · 获赞 12 · 访问量 1306

私信关注

Regularization - Cost function

猜你喜欢