Machine learning portal of the single-variable linear regression (on) - gradient descent

 

In statistics , the linear regression (English: Linear Regression ) is called using the linear regression equation of the least squares function the one or more independent variables and the dependent variables relationship between a modeling for the regression analysis . This function is referred to as a combination of one or more linear model parameter regression coefficients. Only one case of the independent variable is called simple regression, more than one independent variable case is called multiple regression (multivariate linear regression). ------Wikipedia

  For a long time, this part of the ML is a stepping stone, a professor Andrew Ng in his course this is also the first example, while herein, this reference also compelling content for Professor Wu.

  In here, I simply called the independent variable x, called the dependent variable y. In univariate linear regression, x is a one-dimensional continuous value.

  Univariate linear regression i.e. data is given, an optimal fit equation, i.e. to draw a line that best meet the original data (the data is continuous and usually requires).

  In the manual carefully we will use a gradient descent method, fitting a linear approximation to the original data, in other words, to find the best fit linear equations.

 

  Suppose the linear equation is provided as follows:

     (I.e. theta] 0 is a constant term, or that is a constant offset value)

  Before learning gradient descent method, we also need to know some prefixes knowledge, including but not limited to:

  • Normalization (pre-data)

  Normalization formula:

       

  Normalized data are mapped to a [0 1] in the pretreatment zone, in the present example, this treatment mainly to improve the speed of convergence of gradient descent.

 

  • cost function

  cost function (function) is a regression model is fit evaluation function well, the lower the cost function, described better model fits the data set.

  In the present embodiment, using a commonly used variance, i.e.,

     

 

  • Gradient  (gradient) and derived terms iteration

  Gradient directional derivative function on the fastest change of direction, it is an application of partial derivatives (refer to this part of the knowledge of calculus or advanced mathematics content, although even understand also be used directly inference, but if you want in-depth study then it is a thorough grasp as well). Gradient understood by definition:

  

  This part is the partial derivative of θ1, θ0 partial derivative itself may be derived, the above little difference. If doubts (why the partial derivative of the cost function?), Please read on.

 

  Once you understand the above, we know that, for a given θ1 and θ0, by virtue of our cost function will be able to assess their good or bad (the smaller the cost, the better the fit). At the same time, we know whether to θ1 / 0 is, there must be some of the most accurate, best fitting value effect, once the deviation from this value, the farther the deviation fit worse the effect on the value of cost function will greater.

  Θ0 is assumed constant at 0, then the shape of the cost function of the image should be of a similar θ1 valley, as shown below:

      

  Even the lowest point best fits θ1 (cost minimum).

  Similarly, assuming θ0 is not constant, then the image will have two independent variables, substantially shaped like a bowl, as shown below:

   (Figure source Professor Wu course)

  Easy to get, in the 'bowl' of the lowest point (θ0 ', θ1') that is the best fit of θ0 and θ1, then how to find the lowest point of it?

  Gradient descent think so, first take a random point and iterative gradient direction to go, let getting closer to the point, until convergence to the lowest point.

  The principle is very simple, but this iterative formula how to find it?

  The answer is not difficult, it is in accordance with the gradient of the concept:

       (Part of a deflector part of the knowledge described in the prefix)

  Geometric meaning is probably point (θ0, θ1) at a rate α, step gradient, move to the lowest point of a gradient direction (the direction of change fastest function).

  Fake code:

    ① random initialization point

    ② according to a certain proportion / pace (learning rate), downwardly toward the iterative gradient

    ③ sufficient accuracy or stop the convergence

  Some people may not understand the reasons for convergence, which involves part of the knowledge of calculus, roughly like this:

  当点越趋近最低点时,偏导会越来越小,直至为0,此时收敛。可以认为在接近最低点的过程中,切向量的值逐渐减小(碗状曲面逐渐平缓),抵达最低点时,切平面平行于xoy面,切向量为0,此时不论如何迭代,θ的值都不再变化。

 

  我们知道,前面的原理部分并没有涉及到α学习率(learning rate)这个概念,那么这个东西是要干什么呢?

  看过吴教授课程的同学可能知道,在下降的过程中,如果步伐太大的话,是无法收敛到最低点的,如果不加上α来控制步伐大小,在很多情况下,都可能导致无法找到最低点。

  为了解决这个问题,梯度下降引入了α来控制下降的步伐大小,确保能够收敛。但同时,α太小的话,也会导致收敛过慢。

  需要着重说明的是θ1和θ2应该严格同步更新,按我的理解来,这是因为梯度是基于当前点的最大变化值,如果异步更新的话,比方说我们先更新θ1,然后再遍历更新θ0,此时更新θ0是基于新的θ1,所以不满足梯度的要求。

  

  顺带一提,梯度下降总会收敛于局部最小值。不过在单变量线性回归中,局部最小值即是全局最小值。

  

  以上就是单变量线性回归的内容,接下来我们将尝试应用于数据集:

  给出一个数据集(工作经验与年薪)如下:

YearsExperience
Salary
1.1
39343.00
1.3
46205.00
1.5
37731.00
.... ....

 

  文末将给出下载地址。

  首先对数据进行归一化,提高收敛速度。然后我们简单地设置θ初始值为(0,0),α为0.01,精度为1e-4,最大迭代次数为1e4。

  拟合效果如下:

  

  可以看到效果还是比较理想的,接下来是cost的变化图像:

  

  可以看到大致迭代3000次的时候基本收敛了,由于收敛值大于精度,所以迭代次数为设置的最大次数。

  

  总结:

    这部分内容应该说相当好上手,主要把握好梯度这个概念,理解好迭代(逐渐趋于cost局部最小),之后就都不难了。并且梯度下降在机器学习中有比较广泛的应用,所以对它的学习必不可少。

    数据集和代码我都推到github上了,有需要请点击:https://github.com/foolishkylin/workspace/tree/master/machine_learning/getting_started/gradient_descent/liner_regression_single

 

Guess you like

Origin www.cnblogs.com/rosehip/p/10983543.html