Linear Regression with multiple variables - Gradient descent for multiple variables

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第五章《多变量线性回归》中第29课时《多元梯度下降法》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助。

In the previous video (article), we talked about the form of the hypothesis for linear regression with multiple features or with multiple variables. In this video (article), let's talk about how to fit the parameters of that hypothesis. In particular, let's talk about how to use gradient descent for linear regression with multiple features.

To quickly summarize our notation, this is our form of hypothesis in multivariate linear regression where we've adopted the convention that x_{0}=1. The parameters of this model are \theta _{0} through \theta _{n}, but instead of thinking this as n separate parameters, which is valid, I'm instead going to think of the parameters as \theta where \theta here is a n+1-dimensional vector. So I'm just going to think of the parameters of this model as itself being a vector. Our cost function is J(\theta _{0}, \theta _{1},...,\theta _{n}) which is given by this usual sum of square of error term. But again instead of thinking of J as a function of these n+1 numbers, I'm going to more commonly write J as just a function of the parameter vector \theta so that \theta here is a vector. Here's what gradient descent looks like. We're going to repeatedly update each parameter \theta _{j} according to \theta _{j} minus \alpha times this derivative term. And once again we just write this as J(\theta ), so \theta _{j} is updated as \theta _{j}  minus the learning rate \alpha times the derivative, a partial derivative of the cost function with respect to the parameter \theta _{j}. Let's see what this looks like, when we implement gradient descent and in particular, let's go see what that partial derivative term looks like.

Here's what we have for gradient descent for the case of when we had n=1 features. We had two separate update rules for the parameters \theta _{0} and \theta _{1}, and hopefully this looks familiar to you. And this term here (\frac{1}{m} \sum_{i=1}^{m}(h_{\theta }(x^{(i)})-y^{(i)})) was of course the partial derivative of the cost function with respect to the parameter \theta _{0}, and similarly we had a different update rule for the parameter \theta _{1}. There is one little difference, which is that when we previously had only one feature, we would call that feature x^{(i)} but now in our new notation, we would of course call this x^{(i)}_{1} to denote our one feature. So that was for when we had only one feature. Let's look at the new algorithm for we have more than one feature, when the number of features n may be much larger than one. We get this update rule for gradient decent and maybe for those of you that know calculus, if you take the definition of the cost function, and take the partial derivative of the cost function J(\theta ) with respect to the parameter \theta _{j}, you'll find that that partial derivative is exactly that term I've drawn the blue box around. And if you implement this, you will get a working implementation of gradient descent for multivariate linear regression. The last thing I want to do on this slide is give you a sense of why these new and old algorithms are sort of the same thing, or why they're both similar algorithms, or why they're both gradient descent algorithms. Let's consider a case where we have two features or maybe more than two features, so we have three update rules for the parameters \theta _{0}, \theta _{1}, \theta _{2}, and maybe other values of \theta as well. If you look at the update rule for \theta _{0}, what you find is that this update rule here is the same as the update rule that we had previously for the case of n=1. And the reason that they are equivalent is, of course, because in our notational convention, we had this x^{(i)}_{0}=1 convention, which is why these two terms that I've drawn the magenta boxes around are equivalent. Similarly, if you look at the update rule for \theta _{1}, you'll find that this term here is equivalent to the term we previously had, or the equation or the update rule we previously had for \theta _{1}, where of course we're just using this new notation x^{(i)}_1, to denote our new notation for denoting the first feature, and now that we have more than one feature we can have similar update rules for the other parameters like \theta _{2} and so on. There's a lot going on on this slide, so I definitely encourage you, if you need to, to pause the video and look at all the math on this slide slowly to make sure you understand everything that's going on here. But if you implement the algorithm written up here, then you have a working implementation of linear regression with multiple features.

<end>

发布了41 篇原创文章 · 获赞 12 · 访问量 1306

猜你喜欢

转载自blog.csdn.net/edward_wang1/article/details/103465144
今日推荐