【机器学习】1 线性回归

1 Linear Regression with One Variable / Univariate Linear Regression

1.1 model

  • Hypothesis: h θ ( x ) = θ 0 + θ 1 x h_\theta(x)=\theta_0+\theta_1x hθ(x)=θ0+θ1x
  • Parameters: θ 0 , θ 1 \theta_0,\theta_1 θ0,θ1
  • Cost Function:square error function / square error cost function J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^m(h_\theta{(x^{(i)})-y^{(i)})}^2 J(θ0,θ1)=2m1i=1m(hθ(x(i))y(i))2
  • Goal(Object Function): minimize θ 0 , θ 1 J ( θ 0 , θ 1 ) \mathop{\text{minimize}}\limits_{\theta_0,\theta_1} J(\theta_0,\theta_1) θ0,θ1minimizeJ(θ0,θ1)

1.2 ‘Batch’ Gradient Descent Algorithm

to solve minimize θ 0 , θ 1 J ( θ 0 , θ 1 ) \mathop{\text{minimize}}\limits_{\theta_0,\theta_1} J(\theta_0,\theta_1) θ0,θ1minimizeJ(θ0,θ1)

1.2.1 algorithm

  • repeat until convergence{
    θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 ) (for  j = 0  and  j = 1 ) \theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)\text{(for $j=0$ and $j=1$)} θj:=θjαθjJ(θ0,θ1)for j=0 and j=1
             α \alpha α —— learning rate
    }
  • Correct: t e m p 0 : = θ 0 − α ∂ ∂ θ 0 J ( θ 0 , θ 1 ) t e m p 1 : = θ 1 − α ∂ ∂ θ 1 J ( θ 0 , θ 1 ) θ 0 : = t e m p 0 θ 1 : = t e m p 1 \begin{aligned} {temp}_0&:=\theta_0-\alpha\frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)\\ {temp}_1&:=\theta_1-\alpha\frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)\\ \theta_0&:={temp}_0\\ \theta_1&:={temp}_1 \end{aligned} temp0temp1θ0θ1:=θ0αθ0J(θ0,θ1):=θ1αθ1J(θ0,θ1):=temp0:=temp1
  • Notice: need to simultaneously update θ 0 \theta_0 θ0 and θ 1 \theta_1 θ1

1.2.2 use for univariate linear regression

  • repeat{
       θ 0 : = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) θ 1 : = θ 1 − α 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x ( i ) ) \begin{aligned}\theta_0&:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\\ \theta_1&:=\theta_1-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})·x^{(i)}) \end{aligned} θ0θ1:=θ0αm1i=1m(hθ(x(i))y(i)):=θ1αm1i=1m((hθ(x(i))y(i))x(i))
    }

1.2.3 features

  • 在梯度下降的每一步中,我们都用到了所有的训练样本

2 Linear Regression with Multiple Variables / Multivariate Linear Regression

2.1 model

  • Hypothesis: x 0 = 1 x_0=1 x0=1 h θ ( x ) = θ T x = θ 0 x 0 + θ 1 x 1 + ⋅ ⋅ ⋅ + θ n x n h_\theta(x)=\theta^Tx=\theta_0x_0+\theta_1x_1+···+\theta_nx_n hθ(x)=θTx=θ0x0+θ1x1++θnxn
  • Parameters: n + 1 n+1 n+1-demensional vector θ = θ 0 , θ 1 , ⋅ ⋅ ⋅ , θ n \theta=\theta_0,\theta_1,···,\theta_n θ=θ0,θ1,,θn
  • Cost Function:square error function / square error cost function J ( θ ) = J ( θ 0 , θ 1 , ⋅ ⋅ ⋅ , θ n ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta)=J(\theta_0,\theta_1,···,\theta_n)=\frac{1}{2m}\sum_{i=1}^m(h_\theta{(x^{(i)})-y^{(i)})}^2 J(θ)=J(θ0,θ1,,θn)=2m1i=1m(hθ(x(i))y(i))2
  • Goal(Object Function): minimize θ 0 , θ 1 , ⋅ ⋅ ⋅ , θ n J ( θ 0 , θ 1 , θ n ) \mathop{\text{minimize}}\limits_{\theta_0,\theta_1,···,\theta_n} J(\theta_0,\theta_1,\theta_n) θ0,θ1,,θnminimizeJ(θ0,θ1,θn)

2.2 Gradient Descent Algorithm

2.2.1 algorithm

  • repeat until convergence{
    θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 , ⋅ ⋅ ⋅ , θ n ) (simultaneously update  θ j  for  j = 0 , 1 , ⋅ ⋅ ⋅ , n ) \theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1,···,\theta_n)\text{(simultaneously update $\theta_j$ for $j=0,1,···,n$)} θj:=θjαθjJ(θ0,θ1,,θn)simultaneously update θj for j=0,1,,n
    }

2.2.2 use for multiple linear regression

  • repeat{
    x 0 ( i ) = 1 x_0^{(i)}=1 x0(i)=1
       θ 0 : = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x 0 ( i ) ) θ 1 : = θ 1 − α 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x 1 ( i ) ) ⋅ ⋅ ⋅ θ n : = θ n − α 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x n ( i ) ) \begin{aligned}\theta_0&:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})·x_0^{(i)})\\ \theta_1&:=\theta_1-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})·x_1^{(i)})\\ ···\\ \theta_n&:=\theta_n-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})·x_n^{(i)})\\ \end{aligned} θ0θ1θn:=θ0αm1i=1m(hθ(x(i))y(i))x0(i)):=θ1αm1i=1m((hθ(x(i))y(i))x1(i)):=θnαm1i=1m((hθ(x(i))y(i))xn(i))
    }

3 Gradient Descent in Practice

3.1 Feature Scaling(特征缩放)

  • Goal: make sure features are on a similar scale
  • Advantages:
    (1) make gradient desecent run much faster(多快)
    (2) converge in a lot fewer iterations(少收敛)
  • Methods:
    (1) dividing by the maximum value
    (2) mean normalization(均值归一化)

3.1.1 mean normaliztion

  • Theory: replace x i x_i xi with x i − μ i x_i-\mu_i xiμi to make features have approximately zero mean
    Do not apply to x 0 = 1 x_0=1 x0=1
    x i = x i − μ i s i x_i=\frac{x_i-\mu_i}{s_i} xi=sixiμi
  • μ i \mu_i μi —— average value of x i x_i xi in the training sets
  • s i s_i si —— the range of values of that feature
    (1) maximum value − minimum value \text{maximum value} - \text{minimum value} maximum valueminimum value
    (2) standard deviation of the variable \text{standard deviation of the variable} standard deviation of the variable(标准差)

3.2 Learning Rate

  • Debugging: How to make sure gradient descent in working correctly

3.2.1 good method: plot

  • plot the cost function as we increase the number of iterations(画出代价函数值 J ( θ ) J(\theta) J(θ)随迭代次数变化的曲线)
  • Advantages:
    (1) can show gradient descent if working correctly —— J ( θ ) J(\theta) J(θ) should decrease after every iteration(梯度下降法是否正常运行)
    (2) judge whether or not gradient descent has converged(梯度下降是否收敛)

3.2.1 other method

  • declare convergence if J ( θ ) J(\theta) J(θ) decreases by less than ε \varepsilon ε in one iteration
  • Advantage: judge if converging automatically
  • Disadvantage: choose ε \varepsilon ε can be difficult

3.2.3 choose α \alpha α

  • consider: α = 0.01 , 0.03 , 0.1 , 0.3 , 1 , 3 , 10 , ⋅ ⋅ ⋅ \alpha=0.01,0.03,0.1,0.3,1,3,10,··· α=0.01,0.03,0.1,0.3,1,3,10,

3.2.4 plot problem

  • problem: J ( θ ) J(\theta) J(θ) does not decrease after every iteration
  • cause: α \alpha α is too large that J ( θ ) J(\theta) J(θ) may not decrease on every iteration; may not converge and slow converge is also possible
  • Solve: choose sufficiently smaller α \alpha α
  • Problem caused by solution above: α \alpha α is too small that gradient descent can be slow to converge

3.3 Features and Polynomial Regression

  • linear regression does not adapt to all datas
  • polynomial regression can translate into linear regression
  • we need to observe training sets so that we can choose a sufficient model
  • Notice: feature scaling is necessary if choosing polynomial regresssion

4 Normal Equation

a method to solve for θ \theta θ analytically one step to get to the optimal value right

4.1 model

  • m m m examples ( x ( 1 ) , y ( 1 ) ) , ⋅ ⋅ ⋅ , ( x ( m ) , y ( m ) ) (x^{(1)},y^{(1)}),···,(x^{(m)},y^{(m)}) (x(1),y(1)),,(x(m),y(m)) ; n n n features
  • n + 1 n+1 n+1 dimensional vector: x ( i ) = [ x 0 ( i ) ⋅ ⋅ ⋅ x n ( i ) ] x^{(i)}=\left[ \begin{matrix} x_0^{(i)}\\ ···\\ x_n^{(i)}\\ \end{matrix} \right] x(i)=x0(i)xn(i)
  • design matrix ( m ⋅ ( n + 1 ) m·(n+1) m(n+1) dimensional vector): X = [ ( x ( 1 ) ) T ⋅ ⋅ ⋅ ( x ( m ) ) T ] X=\left[ \begin{matrix} {(x^{(1)})}^T\\ ···\\ {(x^{(m)})}^T\\ \end{matrix} \right] X=(x(1))T(x(m))T
  • y = [ y ( 1 ) ⋅ ⋅ ⋅ y ( m ) ] y=\left[ \begin{matrix} y^{(1)}\\ ···\\ y^{(m)}\\ \end{matrix} \right] y=y(1)y(m)
  • θ = ( X T X ) − 1 X T y \theta={(X^TX)}^{-1}X^Ty θ=(XTX)1XTy

4.2 writing

  • Octave
 pinv(X'*X)*X'*y
  • Python
import numpy as np
def notmalEqn(X,y):
	theta = np.linalg.inv(X.T@X)@X.T@y	# X.T@X 等价于 X.T.dot(X)
	return theta

problem: if X T X X^TX XTX is non-invertible ( singular or degenerate matrices)?

  • cause:
    (1) redundant features (liearly dependent)
    (2) too many features —— delete some features or use regularation
  • 出现不可逆矩阵的情况极少发生
  • 伪逆pesudo-inverse:pinv()
  • 逆:inv()

5 梯度下降与正规方程的比较

Gradient Descent Normal Euqation
need to choose α \alpha α no need to choose α \alpha α
need many iterations no need to iterate
works well even when n n n is large slow if n n n is very large because we need to compute ( X T X ) − 1 (X^TX)^{-1} (XTX)1 n n n is hard to get】
适用于各种类型的模型 只适用于 linear regression,不适合 logistic regression algorithm

6 参考

吴恩达 机器学习 coursera machine learning
黄海广 机器学习笔记

猜你喜欢

转载自blog.csdn.net/qq_44714521/article/details/107589453