第1章 线性回归
1 Linear Regression with One Variable / Univariate Linear Regression
1.1 model
- Hypothesis: h θ ( x ) = θ 0 + θ 1 x h_\theta(x)=\theta_0+\theta_1x hθ(x)=θ0+θ1x
- Parameters: θ 0 , θ 1 \theta_0,\theta_1 θ0,θ1
- Cost Function:
square error function / square error cost function
J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^m(h_\theta{(x^{(i)})-y^{(i)})}^2 J(θ0,θ1)=2m1i=1∑m(hθ(x(i))−y(i))2 - Goal(Object Function): minimize θ 0 , θ 1 J ( θ 0 , θ 1 ) \mathop{\text{minimize}}\limits_{\theta_0,\theta_1} J(\theta_0,\theta_1) θ0,θ1minimizeJ(θ0,θ1)
1.2 ‘Batch’ Gradient Descent Algorithm
to solve minimize θ 0 , θ 1 J ( θ 0 , θ 1 ) \mathop{\text{minimize}}\limits_{\theta_0,\theta_1} J(\theta_0,\theta_1) θ0,θ1minimizeJ(θ0,θ1)
1.2.1 algorithm
- repeat until convergence{
θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 ) (for j = 0 and j = 1 ) \theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)\text{(for $j=0$ and $j=1$)} θj:=θj−α∂θj∂J(θ0,θ1)(for j=0 and j=1)
α \alpha α —— learning rate
} - Correct: t e m p 0 : = θ 0 − α ∂ ∂ θ 0 J ( θ 0 , θ 1 ) t e m p 1 : = θ 1 − α ∂ ∂ θ 1 J ( θ 0 , θ 1 ) θ 0 : = t e m p 0 θ 1 : = t e m p 1 \begin{aligned} {temp}_0&:=\theta_0-\alpha\frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)\\ {temp}_1&:=\theta_1-\alpha\frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)\\ \theta_0&:={temp}_0\\ \theta_1&:={temp}_1 \end{aligned} temp0temp1θ0θ1:=θ0−α∂θ0∂J(θ0,θ1):=θ1−α∂θ1∂J(θ0,θ1):=temp0:=temp1
- Notice: need to simultaneously update θ 0 \theta_0 θ0 and θ 1 \theta_1 θ1
1.2.2 use for univariate linear regression
- repeat{
θ 0 : = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) θ 1 : = θ 1 − α 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x ( i ) ) \begin{aligned}\theta_0&:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\\ \theta_1&:=\theta_1-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})·x^{(i)}) \end{aligned} θ0θ1:=θ0−αm1i=1∑m(hθ(x(i))−y(i)):=θ1−αm1i=1∑m((hθ(x(i))−y(i))⋅x(i))
}
1.2.3 features
- 在梯度下降的每一步中,我们都用到了所有的训练样本
2 Linear Regression with Multiple Variables / Multivariate Linear Regression
2.1 model
- Hypothesis:
x 0 = 1 x_0=1 x0=1
h θ ( x ) = θ T x = θ 0 x 0 + θ 1 x 1 + ⋅ ⋅ ⋅ + θ n x n h_\theta(x)=\theta^Tx=\theta_0x_0+\theta_1x_1+···+\theta_nx_n hθ(x)=θTx=θ0x0+θ1x1+⋅⋅⋅+θnxn - Parameters:
n + 1 n+1 n+1-demensional vector
θ = θ 0 , θ 1 , ⋅ ⋅ ⋅ , θ n \theta=\theta_0,\theta_1,···,\theta_n θ=θ0,θ1,⋅⋅⋅,θn - Cost Function:
square error function / square error cost function
J ( θ ) = J ( θ 0 , θ 1 , ⋅ ⋅ ⋅ , θ n ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta)=J(\theta_0,\theta_1,···,\theta_n)=\frac{1}{2m}\sum_{i=1}^m(h_\theta{(x^{(i)})-y^{(i)})}^2 J(θ)=J(θ0,θ1,⋅⋅⋅,θn)=2m1i=1∑m(hθ(x(i))−y(i))2 - Goal(Object Function): minimize θ 0 , θ 1 , ⋅ ⋅ ⋅ , θ n J ( θ 0 , θ 1 , θ n ) \mathop{\text{minimize}}\limits_{\theta_0,\theta_1,···,\theta_n} J(\theta_0,\theta_1,\theta_n) θ0,θ1,⋅⋅⋅,θnminimizeJ(θ0,θ1,θn)
2.2 Gradient Descent Algorithm
2.2.1 algorithm
- repeat until convergence{
θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 , ⋅ ⋅ ⋅ , θ n ) (simultaneously update θ j for j = 0 , 1 , ⋅ ⋅ ⋅ , n ) \theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1,···,\theta_n)\text{(simultaneously update $\theta_j$ for $j=0,1,···,n$)} θj:=θj−α∂θj∂J(θ0,θ1,⋅⋅⋅,θn)(simultaneously update θj for j=0,1,⋅⋅⋅,n)
}
2.2.2 use for multiple linear regression
- repeat{
x 0 ( i ) = 1 x_0^{(i)}=1 x0(i)=1
θ 0 : = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x 0 ( i ) ) θ 1 : = θ 1 − α 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x 1 ( i ) ) ⋅ ⋅ ⋅ θ n : = θ n − α 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x n ( i ) ) \begin{aligned}\theta_0&:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})·x_0^{(i)})\\ \theta_1&:=\theta_1-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})·x_1^{(i)})\\ ···\\ \theta_n&:=\theta_n-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})·x_n^{(i)})\\ \end{aligned} θ0θ1⋅⋅⋅θn:=θ0−αm1i=1∑m(hθ(x(i))−y(i))⋅x0(i)):=θ1−αm1i=1∑m((hθ(x(i))−y(i))⋅x1(i)):=θn−αm1i=1∑m((hθ(x(i))−y(i))⋅xn(i))
}
3 Gradient Descent in Practice
3.1 Feature Scaling(特征缩放)
- Goal: make sure features are on a similar scale
- Advantages:
(1) make gradient desecent run much faster(多快)
(2) converge in a lot fewer iterations(少收敛) - Methods:
(1) dividing by the maximum value
(2) mean normalization(均值归一化)
3.1.1 mean normaliztion
- Theory: replace x i x_i xi with x i − μ i x_i-\mu_i xi−μi to make features have approximately zero mean
Do not apply to x 0 = 1 x_0=1 x0=1
x i = x i − μ i s i x_i=\frac{x_i-\mu_i}{s_i} xi=sixi−μi - μ i \mu_i μi —— average value of x i x_i xi in the training sets
- s i s_i si —— the range of values of that feature
(1) maximum value − minimum value \text{maximum value} - \text{minimum value} maximum value−minimum value
(2) standard deviation of the variable \text{standard deviation of the variable} standard deviation of the variable(标准差)
3.2 Learning Rate
- Debugging: How to make sure gradient descent in working correctly
3.2.1 good method: plot
- plot the cost function as we increase the number of iterations(画出代价函数值 J ( θ ) J(\theta) J(θ)随迭代次数变化的曲线)
- Advantages:
(1) can show gradient descent if working correctly —— J ( θ ) J(\theta) J(θ) should decrease after every iteration(梯度下降法是否正常运行)
(2) judge whether or not gradient descent has converged(梯度下降是否收敛)
3.2.1 other method
- declare convergence if J ( θ ) J(\theta) J(θ) decreases by less than ε \varepsilon ε in one iteration
- Advantage: judge if converging automatically
- Disadvantage: choose ε \varepsilon ε can be difficult
3.2.3 choose α \alpha α
- consider: α = 0.01 , 0.03 , 0.1 , 0.3 , 1 , 3 , 10 , ⋅ ⋅ ⋅ \alpha=0.01,0.03,0.1,0.3,1,3,10,··· α=0.01,0.03,0.1,0.3,1,3,10,⋅⋅⋅
3.2.4 plot problem
- problem: J ( θ ) J(\theta) J(θ) does not decrease after every iteration
- cause: α \alpha α is too large that J ( θ ) J(\theta) J(θ) may not decrease on every iteration; may not converge and slow converge is also possible
- Solve: choose sufficiently smaller α \alpha α
- Problem caused by solution above: α \alpha α is too small that gradient descent can be slow to converge
3.3 Features and Polynomial Regression
- linear regression does not adapt to all datas
- polynomial regression can translate into linear regression
- we need to observe training sets so that we can choose a sufficient model
- Notice: feature scaling is necessary if choosing polynomial regresssion
4 Normal Equation
a method to solve for θ \theta θ analytically one step to get to the optimal value right
4.1 model
- m m m examples ( x ( 1 ) , y ( 1 ) ) , ⋅ ⋅ ⋅ , ( x ( m ) , y ( m ) ) (x^{(1)},y^{(1)}),···,(x^{(m)},y^{(m)}) (x(1),y(1)),⋅⋅⋅,(x(m),y(m)) ; n n n features
- n + 1 n+1 n+1 dimensional vector: x ( i ) = [ x 0 ( i ) ⋅ ⋅ ⋅ x n ( i ) ] x^{(i)}=\left[ \begin{matrix} x_0^{(i)}\\ ···\\ x_n^{(i)}\\ \end{matrix} \right] x(i)=⎣⎢⎡x0(i)⋅⋅⋅xn(i)⎦⎥⎤
- design matrix ( m ⋅ ( n + 1 ) m·(n+1) m⋅(n+1) dimensional vector): X = [ ( x ( 1 ) ) T ⋅ ⋅ ⋅ ( x ( m ) ) T ] X=\left[ \begin{matrix} {(x^{(1)})}^T\\ ···\\ {(x^{(m)})}^T\\ \end{matrix} \right] X=⎣⎢⎡(x(1))T⋅⋅⋅(x(m))T⎦⎥⎤
- y = [ y ( 1 ) ⋅ ⋅ ⋅ y ( m ) ] y=\left[ \begin{matrix} y^{(1)}\\ ···\\ y^{(m)}\\ \end{matrix} \right] y=⎣⎡y(1)⋅⋅⋅y(m)⎦⎤
- θ = ( X T X ) − 1 X T y \theta={(X^TX)}^{-1}X^Ty θ=(XTX)−1XTy
4.2 writing
- Octave
pinv(X'*X)*X'*y
- Python
import numpy as np
def notmalEqn(X,y):
theta = np.linalg.inv(X.T@X)@X.T@y # X.T@X 等价于 X.T.dot(X)
return theta
problem: if X T X X^TX XTX is non-invertible ( singular or degenerate matrices)?
- cause:
(1) redundant features (liearly dependent)
(2) too many features —— delete some features or use regularation - 出现不可逆矩阵的情况极少发生
- 伪逆pesudo-inverse:pinv()
- 逆:inv()
5 梯度下降与正规方程的比较
Gradient Descent | Normal Euqation |
---|---|
need to choose α \alpha α | no need to choose α \alpha α |
need many iterations | no need to iterate |
works well even when n n n is large | slow if n n n is very large because we need to compute ( X T X ) − 1 (X^TX)^{-1} (XTX)−1【 n n n is hard to get】 |
适用于各种类型的模型 | 只适用于 linear regression,不适合 logistic regression algorithm |
6 参考
吴恩达 机器学习 coursera machine learning
黄海广 机器学习笔记