Linear Regression with multiple variables 多变量线性回归
Multiple features
Notation 符号说明
- \(n\) = number of features. 特征量的个数
- \(x^{(i)}\) = input (features) of \(i^{th}\) training example. 第i个训练样本
- \(x_j{(i)}\) = value of feature \(j\) in \(i^{th}\) training example. 第i个训练样本的第j个特征量
Hypothesis
- Previously: \(h_\theta(x) = \theta_0 + \theta_1x\)
- Four features: \(h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + \theta_4x_4\)
- Multiple features: \(\theta_0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n\)
For convenience of notation, define \(x_0 = 1(x_0^{(i)} = 1)\),
Then \(h_\theta(x) = \theta_0x0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n = \theta^Tx\).
我们把它叫做多元线性回归。
Gradient descent for multiple variables
Hypothesis
\(h_\theta(x) = \theta^Tx = \theta_0x0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n\)
Parameters 参数
\(\theta_0, \theta_1, \ldots, \theta_n\) --> (n + 1) - dimensional vector
Cost function 代价函数
\(J(\theta_0, \theta_1, \ldots, \theta_n) = \frac{1}{2m}\sum^m_{i = 1}(h_\theta(x^{(i)}) - y^{(i)})^2\) --> (n + 1) - dimensional vector function
Gradient descent 梯度下降算法
Repeat {
$\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\ldots,\theta_n) $
}
Previously (n = 1):
Repeat {
\(\theta_0 := \theta_0 - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})\)
\(\theta_1 := \theta_1 - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}\)
}
New algorithm (\(n \geq 1\)):
Repeat {
\(\theta_j := \theta_j - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}\)
}
Gradient descent in practice I: Frature Scaling
梯度下降运算中的实用技巧I:特征放缩
Feature Scaling 特征放缩
Idea: Make sure features are on a similiar scale. 确保特征值有近似的规模。
E.g. \(x_1\) = size(0-2000\({feet}^2\)), \(x_2\) = number of bedrooms(1-5)
---> \(x_1\) = \(\frac{size({feet}^2)}{2000}\), \(x_2\) = \(\frac{number of bedrooms}{5}\)
Get every feature into approximately a \(-1 \leq x_i \leq 1\) range. Too small or too large is not acceptable. 让每个特征值在接近[-1,1]的范围内。
\(-100 \leq x_i \leq 100\) or \(-0.0001 \leq x_i \leq 0.0001\) (×)
Mean normalization 均一化
Replace \(x_i\) with \(x_i - \mu_i\) to make features have approxinately zero mean (Do not apply to \(x_0 = 1\)). 用\(x_i - \mu_i\)代替\(x_i\)使得特征值有接近0的平均值。
E.g. \(x_1 = \frac{size - 1000}{2000}\), \(x_2 = \frac{bedrooms - 2}{5}\). --> \(-0.5 \leq x_1 \leq 0.5\), \(-0.5 \leq x_2 \leq 0.5\).
更一般的规律:\(x_1 = \frac{x_1 - \mu_1}{S1}\)。
\(\mu_i: x\)的平均值,\(S_1:\)特征值的范围——最大值减去最小值(或者看做变量的标准差)。
Gradient descent in practice II: learning rate
梯度下降运算中的实用技巧II:学习速率
Making sure gradient descent is working correcrly
\(J(\theta)\) should decrease after every iteration. 通过观察\(J(\theta)\)的曲线随着迭代次数的增加的变化情况,当曲线几乎变为直线时说明梯度下降算法已收敛。
【注】对每一个特定的问题,梯度下降算法所需的迭代次数可以相差很大。
Example automatic convergence test: 自动收敛测试
Declare convergence if \(J(\theta)\) decrease by less than \(10^{-3}\) in one iteration. 如果代价函数\(J(\theta)\)的下降小于一个很小的数\(\varepsilon\),那么就认为已经收敛。
【注】通常选择一个合适的\(\varepsilon\)是非常困难的,所以通常用方法一。
Choose learning rate \(\alpha\)
Summary
If \(\alpha\) is too small: slow convergence. 如果\(\alpha\)太小会导致收敛速度慢。
If \(\alpha\) is too large: \(J(\theta)\) may not decrease on every iteration; may not converge. 如果\(\alpha\)太大会导致\(J(\theta)\)并不是在每一步都减小或者不收敛。
Choose \(\alpha\)
try …,0.001,0.003,0.01,0.03,0.1,0.3,1,…
Features and polynomial regression 特征值和多项式回归
Housing prices prediction
\(h_\theta(x) = \theta_0 + \theta_1 \times frontage + \theta_2 \times depth\)
--> Land area: \(x = frontage \times depth\) --> $h_\theta(x) = \theta_0 + \theta_1x $
有时从另一个角度去审视问题,定义一个新的特征值,而不是直接使用开始时使用的特征值,确实会得到一个更好的模型。
Polynomial regression 多项式回归
例如:当直线不能很好的拟合曲线时,选择二次/三次模型。
\(h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x3 = \theta_0 + \theta_1(size) + \theta_2(size)^2 + \theta_3(size)^3\)
其中,\(x_1 = (size), x_2 = (size)^2, x_3 = (size)^3\)
Normal equation 标准方程法
Overview
Method to solve for \(\theta\) analytically 一种求\(\theta\)的解析解法。
与梯度下降法不同的是,此方法可直接一次性求解\(\theta\)的最优值。
Intuition
If 1D(\(\theta \in R\)) 如果\(\theta\)是个实数
另\(\frac{d}{d\theta}J(\theta) = 0\) \(\rightarrow\) \(\theta\)
\(\theta \in R^{n+1}\), \(J(\theta_0,\theta_1,\ldots,\theta_m) = \frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2\)
另\(\frac{\partial}{\partial\theta_j}J(\theta) = 0\) (for every \(j\)) \(\rightarrow\) \(\theta_0, \theta_1,\ldots,\theta_n\)
写成向量组的形式后,可通过下面的公式直接计算,证明略。\(X\theta = y\), \(X^TX\theta = X^Ty\), \(\rightarrow\) \(\theta = (X^TX)^{-1}X^Ty\).
pinv(X'*X)*X'*y
Advantages and disadvantages
m training examples, n features.
- Gradient Descent
- Need to choose \(\alpha\). 需要选择学习速率\(\alpha\)
- Needs many iterations. 需要多次迭代
- Works well even \(n\) is large. 即使有很多特征变量也能运行的很好
- Normal Equation
- No need to choose \(\alpha\). 不需要选择学习速率\(\alpha\)
- Don't need to iterate. 不需要迭代
- Need to compute \((X^TX)^{-1}\). 需要计算这一项(接近\(n^3\))
- Slow if \(n\) is very large. 如果n很大则会很慢
Normal equation and non-invertibility
正规方程及它们的不可逆性
What if \(X^TX\) is non-invertible(singular/degenerate)?
当\(X^TX\)不可逆时怎么办(为奇异矩阵或退化矩阵)?
Redundant features (linearly dependent). 有多余的特征值时删掉多余部分
E.g. \(x_1\) = size in \(feet^2\), \(x_2\) = size in \(m^2\)
Too many features(e.g. \(m \leq n\)). 太多特征值
Delete some features, or use regularization. 删除某些特征值或者对其进行正则化