【机器学习】Coursera Machine Learning - Linear Regression 线性回归

最近打算系统地过下Coursera公开课平台上的《Machine Learning》, 刚把前两周课程的上完，虽然内容简单，感觉收获还是挺大的，有些细节Andrew老师是讲得非常好。后面计划针对每次编程作业，将对应的课程内容中，这轮学习中我认为值得关注的内容记录如此。

还有因为自己听课或做实验有些笔记是采用的英文，所以博客也会采用中英文混杂的方式，如有读者看着不便，敬请见谅哈。

一些基本概念

What is machine learning?

Arthur Samuel: The field of study that gives computers the ability to learn without being explicitly programmed.
Tom Mitchell: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Unsupervised Learning
Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don’t necessarily know the effect of the variables.

Clustering: news search, social network analysis, market segmentation, astronmical data analysis
Non-clustering: cocktail party problem

使用基于梯度的优化方法注意事项

在使用梯度下降(SGD)对线性回归问题（L2问题, MSE损失函数）进行优化的时候，有两点需要特别注意：
1. Feature scaling and mean normalization
Make sure features are on a similar scale, so that the gradient can be more consistent. Close to the range of [-1, 1] is just okay.
@Andrew - machine learning
上图表示的是feature scaling, 此外再结合mean normaliztion, 综合如下：
$x_i := \frac{x_i - u_i} {s_i}$
Where $u_i$ is the average of all the values for feature (i) and $s_i$ is the range of values (max - min), or $s_i$ is the standard deviation.

现在流行的Batch normalization的基础操作也是如此，可以看出掌握机器学习的基本概念和方法是多么重要，很多先进的算法都是来源于这些基础知识！

2. learning rate
选择一个合适的learn rate非常重要，太大了，损失函数很可能发散或者是震荡，太小了，收敛速度就会变得很慢，可以借鉴Andrew老师的方式，每隔3倍取。
在这里插入图片描述

正规方程与闭式解

对于这个问题，除了梯度的方法，还可以直接通过normal equation (正规方程) 求解闭式解，式子为 $\theta = (X^TX)^{-1}X^Ty$ ，这个式子可以通过求解最小二乘问题(MSE损失函数)推导得到。此时，feature scaling 就没有必要采用了。

与基于梯度的方法的对比：当 $n$ 非常大时，因为求逆的原因，闭式解求解将非常耗时，此时就最好用梯度方法，当 $n$ 比较小时，直接求解闭式解将更加方便。
在这里插入图片描述
如果 $X^TX$ 不可逆，这个问题就变得麻烦一些了，Andrew老师提出了下面建议：首先查看是否有冗余的特征（特征之间线性相关了），然后看能否删除一些重要性不大的特征，或采用正则化。

实验部分

ex1作业完成后的代码放在了github这个页面上，需要的同学可以查看。

Learn rate
The learn rate influence the convergence speed of the optimization. Following figure (1) illustrates the convergence curves along with different learn rates for the linear regression problem.
Feature scaling and mean normalization
上面那个图是对特征进行了归一化的，这里我去掉归一化再来使用基于梯度的优化看看。

Following experiments (Figure 2) setting is without feature normalization, in order to converge, the learn rate should be very small because of the extreme sharp loss slope.

Although the Figure 1 and 2 look very close, the final loss after 500 epochs are different, where is 2.0433e9 for Figure 1 and 2.3978e9 for Figure 2.

最后我们在将这两种方式与闭式解的方法进行对比，采用一个test data (means 1650 sq-ft, 3 br house) ，结果如下：

Method	Prediction
SGD w/o feature normalization	$272882.552145
SGD w feature normalization	$293081.464335
Normal equation (analytic method)	$293081.464338

可以看出，使用了feature normalization的梯度优化的结果，跟闭式解结果几乎一样，而没有使用feature normalization的结果就差不少！可以看出因为没有采用feature normalization，其学习速率必须设得非常小，优化时就会可能会停留在最优解的附近，很难前进到最优解。

最后多说一句，最小二乘问题是机器学习中非常经典和基础的问题，它涉及到非常多的知识点，如机器学习中的损失函数、优化方法（基于梯度的，基于数值的），利用矩阵分析中的知识（伪逆、值域、张成子空间、正交投影矩阵、超定方程）来理解最小二乘的几何意义，与最大似然之间的联系（误差的先验分布），不同的正则化方式（L2约束-岭回归，L1约束-Lasso回归）等等，后面有机会专门写个总结下，这里暂时不表。