Linear Regression with multiple variables 多变量线性回归

Multiple features

Notation 符号说明

$n$ = number of features. 特征量的个数
$x^{(i)}$ = input (features) of $i^{th}$ training example. 第i个训练样本
$x_j{(i)}$ = value of feature $j$ in $i^{th}$ training example. 第i个训练样本的第j个特征量

Hypothesis

Previously: $h_\theta(x) = \theta_0 + \theta_1x$
Four features: $h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + \theta_4x_4$
Multiple features: $\theta_0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n$
For convenience of notation, define $x_0 = 1(x_0^{(i)} = 1)$,
Then $h_\theta(x) = \theta_0x0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n = \theta^Tx$.
我们把它叫做多元线性回归。

Gradient descent for multiple variables

Hypothesis

$h_\theta(x) = \theta^Tx = \theta_0x0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n$

Parameters 参数

$\theta_0, \theta_1, \ldots, \theta_n$ --> (n + 1) - dimensional vector

Cost function 代价函数

$J(\theta_0, \theta_1, \ldots, \theta_n) = \frac{1}{2m}\sum^m_{i = 1}(h_\theta(x^{(i)}) - y^{(i)})^2$ --> (n + 1) - dimensional vector function

Gradient descent 梯度下降算法

Repeat {

$\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\ldots,\theta_n) $

}
Previously (n = 1):

Repeat {

$\theta_0 := \theta_0 - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})$

$\theta_1 := \theta_1 - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}$

}
New algorithm ($n \geq 1$):

Repeat {

$\theta_j := \theta_j - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$

}

Gradient descent in practice I: Frature Scaling

梯度下降运算中的实用技巧I：特征放缩

Feature Scaling 特征放缩

Idea: Make sure features are on a similiar scale. 确保特征值有近似的规模。
E.g. $x_1$ = size(0-2000${feet}^2$), $x_2$ = number of bedrooms(1-5)

---> $x_1$ = $\frac{size({feet}^2)}{2000}$, $x_2$ = $\frac{number of bedrooms}{5}$
Get every feature into approximately a $-1 \leq x_i \leq 1$ range. Too small or too large is not acceptable. 让每个特征值在接近[-1,1]的范围内。

$-100 \leq x_i \leq 100$ or $-0.0001 \leq x_i \leq 0.0001$ (×)

Mean normalization 均一化

Replace $x_i$ with $x_i - \mu_i$ to make features have approxinately zero mean (Do not apply to $x_0 = 1$). 用$x_i - \mu_i$代替$x_i$使得特征值有接近0的平均值。
E.g. $x_1 = \frac{size - 1000}{2000}$, $x_2 = \frac{bedrooms - 2}{5}$. --> $-0.5 \leq x_1 \leq 0.5$, $-0.5 \leq x_2 \leq 0.5$.
更一般的规律：$x_1 = \frac{x_1 - \mu_1}{S1}$。

$\mu_i: x$的平均值，$S_1:$特征值的范围——最大值减去最小值（或者看做变量的标准差）。

Gradient descent in practice II: learning rate

梯度下降运算中的实用技巧II：学习速率

Making sure gradient descent is working correcrly

$J(\theta)$ should decrease after every iteration. 通过观察$J(\theta)$的曲线随着迭代次数的增加的变化情况，当曲线几乎变为直线时说明梯度下降算法已收敛。

【注】对每一个特定的问题，梯度下降算法所需的迭代次数可以相差很大。
Example automatic convergence test: 自动收敛测试

Declare convergence if $J(\theta)$ decrease by less than $10^{-3}$ in one iteration. 如果代价函数$J(\theta)$的下降小于一个很小的数$\varepsilon$，那么就认为已经收敛。

【注】通常选择一个合适的$\varepsilon$是非常困难的，所以通常用方法一。

Choose learning rate $\alpha$

Summary

If $\alpha$ is too small: slow convergence. 如果$\alpha$太小会导致收敛速度慢。

If $\alpha$ is too large: $J(\theta)$ may not decrease on every iteration; may not converge. 如果$\alpha$太大会导致$J(\theta)$并不是在每一步都减小或者不收敛。
Choose $\alpha$

try …，0.001，0.003，0.01，0.03，0.1，0.3，1，…

Features and polynomial regression 特征值和多项式回归

Housing prices prediction

$h_\theta(x) = \theta_0 + \theta_1 \times frontage + \theta_2 \times depth$

--> Land area: $x = frontage \times depth$ --> $h_\theta(x) = \theta_0 + \theta_1x $

有时从另一个角度去审视问题，定义一个新的特征值，而不是直接使用开始时使用的特征值，确实会得到一个更好的模型。

Polynomial regression 多项式回归

例如：当直线不能很好的拟合曲线时，选择二次/三次模型。

$h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x3 = \theta_0 + \theta_1(size) + \theta_2(size)^2 + \theta_3(size)^3$
其中，$x_1 = (size), x_2 = (size)^2, x_3 = (size)^3$

Normal equation 标准方程法

Overview

Method to solve for $\theta$ analytically 一种求$\theta$的解析解法。

与梯度下降法不同的是，此方法可直接一次性求解$\theta$的最优值。

Intuition

If 1D($\theta \in R$) 如果$\theta$是个实数

另$\frac{d}{d\theta}J(\theta) = 0$ $\rightarrow$ $\theta$
$\theta \in R^{n+1}$, $J(\theta_0,\theta_1,\ldots,\theta_m) = \frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2$

另$\frac{\partial}{\partial\theta_j}J(\theta) = 0$ (for every $j$) $\rightarrow$ $\theta_0, \theta_1,\ldots,\theta_n$
写成向量组的形式后，可通过下面的公式直接计算，证明略。$X\theta = y$, $X^TX\theta = X^Ty$, $\rightarrow$ $\theta = (X^TX)^{-1}X^Ty$.
```
pinv(X'*X)*X'*y
```

Advantages and disadvantages

m training examples, n features.

Gradient Descent
- Need to choose $\alpha$. 需要选择学习速率$\alpha$
- Needs many iterations. 需要多次迭代
- Works well even $n$ is large. 即使有很多特征变量也能运行的很好
Normal Equation
- No need to choose $\alpha$. 不需要选择学习速率$\alpha$
- Don't need to iterate. 不需要迭代
- Need to compute $(X^TX)^{-1}$. 需要计算这一项（接近$n^3$）
- Slow if $n$ is very large. 如果n很大则会很慢

Normal equation and non-invertibility

正规方程及它们的不可逆性

What if $X^TX$ is non-invertible(singular/degenerate)?

当$X^TX$不可逆时怎么办（为奇异矩阵或退化矩阵）？

Redundant features (linearly dependent). 有多余的特征值时删掉多余部分

E.g. $x_1$ = size in $feet^2$, $x_2$ = size in $m^2$
Too many features(e.g. $m \leq n$). 太多特征值

Delete some features, or use regularization. 删除某些特征值或者对其进行正则化

机器学习（Machine Learning）- 吴恩达（Andrew Ng）学习笔记（四）

Multiple features

Notation 符号说明

Hypothesis

Gradient descent for multiple variables

Hypothesis

Parameters 参数

Cost function 代价函数

Gradient descent 梯度下降算法

Gradient descent in practice I: Frature Scaling

Feature Scaling 特征放缩

Mean normalization 均一化

Gradient descent in practice II: learning rate

Making sure gradient descent is working correcrly

Choose learning rate \(\alpha\)

Features and polynomial regression 特征值和多项式回归

Housing prices prediction

Polynomial regression 多项式回归

Normal equation 标准方程法

Overview

Intuition

Advantages and disadvantages

Normal equation and non-invertibility

猜你喜欢

机器学习（Machine Learning）- 吴恩达（Andrew Ng） 学习笔记（四）

Multiple features

Notation 符号说明

Hypothesis

Gradient descent for multiple variables

Hypothesis

Parameters 参数

Cost function 代价函数

Gradient descent 梯度下降算法

Gradient descent in practice I: Frature Scaling

Feature Scaling 特征放缩

Mean normalization 均一化

Gradient descent in practice II: learning rate

Making sure gradient descent is working correcrly

Choose learning rate \(\alpha\)

Features and polynomial regression 特征值和多项式回归

Housing prices prediction

Polynomial regression 多项式回归

Normal equation 标准方程法

Overview

Intuition

Advantages and disadvantages

Normal equation and non-invertibility

猜你喜欢

机器学习（Machine Learning）- 吴恩达（Andrew Ng）学习笔记（四）