Andrew Ng machine learning (c) Multivariate linear regression equation with the regular

A multivariable linear regression

1, some of the symbols

The so-called multi-variable refers to a sample there are more features, more features which form a feature vector. For example, we describe one thing need to describe its various features in order to determine the things, such as the area of ​​the house, number of rooms, floors and other characteristics, to facilitate the operation, we use vectors to represent. As follows.

 

 

 

 

 

 

 

 

 

 

 

Each row in the table of FIG. Is a feature vector. Due to an increase in features, we need to add some symbols.

 n: number of features. For example, house size, number of rooms, number of layers of these three features.

X (i) : i-th sample showing (eigenvectors).

X (i) j : i-th sample of the j th feature.

 2, assuming the function

After adding features, we assume that function also becomes complicated. We still assume a linear function, H [theta] (X) = [theta] 0 + [theta] . 1 X . 1 + [theta] 2 X 2 + ... + [theta] n- X n- . To facilitate that we need to be represented as a multiplication of two vectors. Set x 0 =. 1, the function is assumed to obtain a vector, where θ and x are column vectors. Thus, assuming the function can be expressed as H [theta] (X) = [theta] T X. With the addition of X 0 =. 1, it corresponds to θ n + 1 is also the vector dimension.

 

 

 

 

 

 

 

 

 

3, loss of function and gradient descent

Like the univariate linear regression, we still use the Euclidean distance represented by the difference between the predicted and actual values ​​to obtain a loss function J (θ)

 

 

 

 

 

 

Similarly, we use gradient descent algorithm to obtain the minimum point. He noted that the figure only one formula, which is due to the additional X 0  =. 1, it is not particularly shown out.

Second, some of the techniques

1, feature scaling (Feature Scaling)

  • Problem: Different values of a characteristic sample vary greatly. For example, the range of x1 is 0 to 2000, and the scope of x2 is 1 to 5, the problem caused in this that J (θ) of the image will be very flat, and using a gradient descent algorithm [theta] 1 will be such that [theta] when a slight change J makes a great (θ) changes, resulting in shock phenomenon.
  • Solution: feature scaling.
  • Specifically: In addition we add x 0 , other features of the feature divided by the maximum absolute value, so that all the features are in the -1 to 1.

Zoom feature can not only solve the problem of shock, but also speed up the gradient descent algorithm convergence rate. Notably, we do not need all of the features that are strictly within the range of -1 to 1, insofar as close to each of the different features. For example, X 1 ranges from 0 to 500, it will be reduced to 0 to 1, and X 2 ranges from -3 to 3, which is no problem, does not require X 2 feature scaling. Remember, there is only one principle that makes J (θ) as much as possible in a circle on the image, it can not be too flat.

2, mean normalization (Mean Normalization)

Another way is to mean normalization. Specifically: Solution of each feature mean and standard deviation (or difference between the maximum and the minimum), and wherein the subtracting the average of all samples divided by the corresponding features characteristic for the standard difference.

Third, the diagnosis (Debugging)

1, the diagnostic algorithm is working properly

可以画出损失函数J(θ)随着迭代次数的变化的曲线。由于我们的假设函数越来越拟合数据,所以每次迭代后J(θ)应该减小。我们可以自己观察,若J(θ)在某个迭代次数(例如迭代400次)之后变化不大,那么可以认为算法正常收敛了。也可以使用自动测试的办法,若J(θ)与前一次迭代的值相差小于等于10-3,那么也可以认为算法收敛了,但是这个阈值其实不好确定,故实际中还是自己观察为主。

2、学习率α

若J(θ)如下图①中那样变化,原因是α太大,导致θ越过最小值点,使得J(θ)一直增大;若如图②那样变化,原因也是α太大,导致J(θ)反复震荡。这两种情况的解决方法都是尝试减小α,找到合适的α。

总的来说,若α太大就可能出现上述情况,导致J(θ)收敛得很慢,甚至无法收敛;但如果α太小,则会导致算法收敛的很慢,需要更多次迭代才能收敛到最小值点。

关于α的选择,需要在选好的几个α中尝试并画出J(θ)的曲线,观察选出最优的α。例如选择0.01,0.03, 0.1,0.3等数值。

四、多项式回归(Polynomial Regression)

若样本呈现的规律不是直线,而是曲线的话,就需要使用多项式来拟合数据。若需要拟合的曲线是hθ(x) = θ0 + θ1x + θ2x2 + θ3x3,我们可以转化为多变量线性回归问题,令x1 = x, x2 = x2, x3 = x3,这样,假设函数就变成hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3,这就变成我们熟悉的多变量线性回归问题了。

五、正规方程(Normal Equation)

1、正规方程求解过程

换个角度看问题,我们之前是使用损失函数J(θ),用欧氏距离来度量预测值与实际值的差距,并通过梯度下降算法来逐渐求得θ。所以我们的目的就是使得预测值与实际值尽可能接近。这样,对于某一样本,我们可以令θTx(i) = y(i),解出该方程即可求得使得hθ(x)最符合实际值y(i)的θ了,同时也使得损失函数J(θ)最小(因为拟合了y(i)了)。按照这种思路,我们对每一个样本建立方程,得到方程组。故问题变为求解这个方程组得出θ。有如下符号,X表示所有样本,y表示实际值,X表示为矩阵,y表示为向量。

这样得到方程组 Xθ = y,两边左乘XT(乘以转置矩阵主要是为了得到方阵XTX),便于求逆矩阵),再左乘以(XTX)-1,便可得到

在octave(用matlab也行)使用pinv函数求逆矩阵,使用X'求转置矩阵。在octave中只需上图中蓝色字体那一行代码便可求得θ。

2、梯度下降与正规方程对比

正规方程求解十分便利,看起来似乎不需要梯度下降算法求解θ了。其实他们各有优缺点,下表列出梯度下降与正规方程的对比。

梯度下降 正规方程
需要选择学习率α 不需要选择α
需要多次迭代求解θ 不需要迭代,直接求出θ
即使特征数量n很大也能很快求解 需要计算(XTX)-1,算法复杂度为O(n3),当n很大时算法会比较慢

 

 

 

 

 

 

 

所以若每一个样本的特征数很大(例如大于10000),则应该使用梯度下降算法来求解θ。

3、XTX不可逆怎么办

XTX不可逆时称为奇异矩阵(Singular)或退化矩阵(Degenerate)。

解决方法很简单,在Octave中使用pinv命令时一定能求出逆矩阵(inv命令当矩阵不可逆时无法求出)。

XTX不可逆的原因主要时以下几个原因:

  • 存在线性相关的特征。例如x1用平方英尺表示房屋面积,而x2用平方米表示面积,那么x1与x2是线性相关的。解决方法是删除某一个线性相关的某一特征。
  • 特征太多,导致样本数m少于特征数n。从线性代数的角度看,此时方程组未知变量比方程多,此时Xθ = y可能无解,也可能有无数解,但是绝不可能只有唯一解。解决方法手动删除某些特征使得m>n,或者使用正则化(XTX正则化之后必定可逆)。

Guess you like

Origin www.cnblogs.com/yayuanzi8/p/11031007.html