Andrew Ng machine learning introductory notes 3- linear regression

3 Linear Regression

3.1 method of least squares

Trying to find a straight line, so that all the samples to the straight line distance and minimum European

3.2 cost function

function cost , so that it is often minimized

Univariate linear regression function is assumed
\ [h (\ theta) = \ theta_0 + \ theta_1x \ tag {3.1} \]
[Image dump outer link failure (img-uwFyHX2d-1568602098755) (E: \ Artificial Intelligence Markdown \ Machine Learning \ pictures \ 3.2 cost function .png)]

3.2.1 gradient descent

Stop for \ (\ Theta \) iteratively calculated cost function J is minimized

[Image dump outer link failure (img-0joBWK6T-1568602098756) (E: \ Artificial Intelligence Markdown \ Machine Learning \ pictures \ 3.2.1 gradient descent summary .png)]

[Image dump outer link failure (img-E8lb3O2J-1568602098757) (E: \ Artificial Intelligence Markdown \ Machine Learning \ pictures \ 3.2.1 gradient descent visual interpretation .png)]

  • \ (\ alpha \) is too small, the cost function requires a lot of steps to reach the lowest point of the global
  • \ (\ alpha \) is too large, the cost function may be past the lowest point, and even lead to convergence divergence
  • Select Normal \ (Alpha \ \) , the process close to the local minimum point depressants, as local lowest point derivative is 0, since the slope becomes smaller, the decrease in the gradient will be smaller and smaller, there is no fear \ (\ Alpha \) is too large lead to convergence
  • If the cost function has reached local optima, the next parameter values ​​will not be updated
(A) Batch gradient descent

Every gradient descent are using the entire training set data

3.2.2 cost function - squared error function

\[ J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)^2\tag{3.2} \]

[Image dump outer link failure (img-Mwb2RX9I-1568602098761) (E: \ Artificial Intelligence Markdown \ Machine Learning \ pictures \ 3.2.2 squared error function .png)]

More than 3.3 yuan linear regression

多元即多个未知数\(x\),多个参数\(\theta\),其中\(x_0=1\)
\[ h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+\cdots+\theta_nx_n=\theta^Tx\tag{3.3} \]

\[ x=\left[\begin{matrix}x_0\\x_1\\x_2\\\vdots\\x_n\end{matrix}\right] \theta=\left[\begin{matrix}\theta_0\\\theta_1\\\theta_2\\\vdots\\\theta_n\end{matrix}\right] \]

3.3.1 多元梯度下降法

[外链图片转存失败(img-yv969LKY-1568602098763)(E:\Artificial Intelligence Markdown\Machine Learning\pictures\3.3.1 多元梯度下降法.png)]

增快梯度下降速度的技巧

  • 特征缩放,接近[-3 3],过大过小都需要改变[外链图片转存失败(img-PpkdibbU-1568602098765)(E:\Artificial Intelligence Markdown\Machine Learning\pictures\3.3.1 加速下降_确保取值范围归一.png)]
  • 每个特征值减去总特征值的平均
  • \(x_1\leftarrow\frac{x_1-\mu_1}{S_1}\)\(S_1\)为特征值范围

确保代价函数下降正确方法

  • 降低学习率\(\alpha\):eg:0.003 0.03 0.3尽可能大
  • 画出\(J(\theta)\)-迭代次数的曲线,确保其随着次数增加函数减小

依据经验选择不同特征进行线性回归

3.3.2 正规方程

无需进行多次迭代运算即可得到合适的参数\(\theta\)值使代价函数下降到最小
\[ \theta=(X^TX)^{-1}X^Ty\tag{3.4} \]
X为特征矩阵,y为特征计算的真实值[外链图片转存失败(img-OfzAnyFH-1568602098766)(E:\Artificial Intelligence Markdown\Machine Learning\pictures\3.3.2 正规方程.png)]

3.3.3 梯度下降法和正规方程的优缺点

梯度下降法 正规方程
缺点 1.需要选择学习率\(\alpha\)
2.需要多次迭代计算
1.\((X^TX)^{-1}\)计算复杂
优点 计算速度受特征变量维度影响小 1.无需考虑学习率
2.无需进行迭代计算
  • 特征变量<10000可采用正规方程,再大则选取梯度下降法

3.4 编程tips

3.4.1 matlab

有关代码风格

  1. %%一段结束后空一行阅读容易
  • 作梯度下降时,进行代价函数偏导计算记得给求和后矩阵转置,令参数量与参数改变量对应

\[ \theta_j:=\theta_j-\alpha\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2x_j^{(i)} \]

  • matlab中,inv(A)*B速度比A\B慢,\代表求解线性方程组Ax=B
    \[ inv(X'X)X==(X'X)\setminus{X} \]

  • 进行对应列运算时,可采用bsxfun(fun,A,B)实现,速度更快

  • 多元变量情况下,采用向量化计算更快
    \[ J(\theta)=\frac{1}{2m}(X\theta-\stackrel{\rightarrow}{y})^T(X\theta-\stackrel{\rightarrow}{y}) \]

  • Multivariate linear regression for variables, predicting new data need to be standardized and then the right side of the new data matrix plus one

  • No provision is actually something \ (J (\ theta) \ ) just write its partial derivative, but in order to monitor \ (J (\ theta) \ ) value to ensure proper gradient descent work, you need to record the cost function of each update the amount

  • opptions=optimset('Gradobj','On','MaxIter','100');
    initialTheta=zeros(2,1);
    [jVal,gradient]=costFunction(theta);%gradient表示梯度,即代价函数对各个参数的偏导
    [optTheta,functionval,exitFlag]=fminunc(@costFunction,initialTheta,options);%@表示地址,fminunc可使代价函数最小化
    %该函数要求参数至少为二维
    %利用设定的高级函数计算代价函数最小化时的参数值
  • %反向传播
    thetaVec=[Theta1(:) ; Theta2(:) ; Theta3(:)];
    DVec=[D1(:) ; D2(:) ; D3(:)];
    Theta1=reshape(thetaVec(1:110),10,11);
    Theta2=reshape(thetaVec(111:220),10,11);
    Theta1=reshape(thetaVec(221:231),1,11);

    [外链图片转存失败(img-KrwH0cp0-1568602098767)(E:\Artificial Intelligence Markdown\Machine Learning\pictures\3.4.1 反向传播matlab算法.png)]

  • In maltab value sum (A. ^ 2) and A * A 'are slightly different

  • Note that when using a matrix programmed to function to meet different situations Dimension

  • Usually SVM package will automatically add the offset amount \ (. 1 x_0 = \) , so no extra

  • Remember application training set rather than use the cross-validation set svmTrain

Guess you like

Origin www.cnblogs.com/jestland/p/11548483.html