机器学习(Machine Learning)- 吴恩达(Andrew Ng) 学习笔记(四)

Linear Regression with multiple variables 多变量线性回归

Multiple features

Notation 符号说明

  1. \(n\) = number of features. 特征量的个数
  2. \(x^{(i)}\) = input (features) of \(i^{th}\) training example. 第i个训练样本
  3. \(x_j{(i)}\) = value of feature \(j\) in \(i^{th}\) training example. 第i个训练样本的第j个特征量

Hypothesis

  1. Previously: \(h_\theta(x) = \theta_0 + \theta_1x\)
  2. Four features: \(h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + \theta_4x_4\)
  3. Multiple features: \(\theta_0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n\)
    For convenience of notation, define \(x_0 = 1(x_0^{(i)} = 1)\),
    Then \(h_\theta(x) = \theta_0x0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n = \theta^Tx\).
    我们把它叫做多元线性回归。

Gradient descent for multiple variables

Hypothesis

\(h_\theta(x) = \theta^Tx = \theta_0x0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n\)

Parameters 参数

\(\theta_0, \theta_1, \ldots, \theta_n\) --> (n + 1) - dimensional vector

Cost function 代价函数

\(J(\theta_0, \theta_1, \ldots, \theta_n) = \frac{1}{2m}\sum^m_{i = 1}(h_\theta(x^{(i)}) - y^{(i)})^2\) --> (n + 1) - dimensional vector function

Gradient descent 梯度下降算法

  1. Repeat {

    ​ $\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\ldots,\theta_n) $

    }

  2. Previously (n = 1):

    Repeat {

    \(\theta_0 := \theta_0 - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})\)

    \(\theta_1 := \theta_1 - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}\)

    }

  3. New algorithm (\(n \geq 1\)):

    Repeat {

    \(\theta_j := \theta_j - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}\)

    }

Gradient descent in practice I: Frature Scaling

梯度下降运算中的实用技巧I:特征放缩

Feature Scaling 特征放缩

  1. Idea: Make sure features are on a similiar scale. 确保特征值有近似的规模。

  2. E.g. \(x_1\) = size(0-2000\({feet}^2\)), \(x_2\) = number of bedrooms(1-5)

    ---> \(x_1\) = \(\frac{size({feet}^2)}{2000}\), \(x_2\) = \(\frac{number of bedrooms}{5}\)

  3. Get every feature into approximately a \(-1 \leq x_i \leq 1\) range. Too small or too large is not acceptable. 让每个特征值在接近[-1,1]的范围内。

    \(-100 \leq x_i \leq 100\) or \(-0.0001 \leq x_i \leq 0.0001\) (×)

Mean normalization 均一化

  1. Replace \(x_i\) with \(x_i - \mu_i\) to make features have approxinately zero mean (Do not apply to \(x_0 = 1\)). 用\(x_i - \mu_i\)代替\(x_i\)使得特征值有接近0的平均值。

  2. E.g. \(x_1 = \frac{size - 1000}{2000}\), \(x_2 = \frac{bedrooms - 2}{5}\). --> \(-0.5 \leq x_1 \leq 0.5\), \(-0.5 \leq x_2 \leq 0.5\).

  3. 更一般的规律:\(x_1 = \frac{x_1 - \mu_1}{S1}\)

    \(\mu_i: x\)的平均值,\(S_1:\)特征值的范围——最大值减去最小值(或者看做变量的标准差)。

Gradient descent in practice II: learning rate

梯度下降运算中的实用技巧II:学习速率

Making sure gradient descent is working correcrly

  1. \(J(\theta)\) should decrease after every iteration. 通过观察\(J(\theta)\)的曲线随着迭代次数的增加的变化情况,当曲线几乎变为直线时说明梯度下降算法已收敛。

    【注】对每一个特定的问题,梯度下降算法所需的迭代次数可以相差很大。

  2. Example automatic convergence test: 自动收敛测试

    Declare convergence if \(J(\theta)\) decrease by less than \(10^{-3}\) in one iteration. 如果代价函数\(J(\theta)\)的下降小于一个很小的数\(\varepsilon\),那么就认为已经收敛。

    【注】通常选择一个合适的\(\varepsilon\)是非常困难的,所以通常用方法一。

Choose learning rate \(\alpha\)

  1. Summary

    If \(\alpha\) is too small: slow convergence. 如果\(\alpha\)太小会导致收敛速度慢。

    If \(\alpha\) is too large: \(J(\theta)\) may not decrease on every iteration; may not converge. 如果\(\alpha\)太大会导致\(J(\theta)\)并不是在每一步都减小或者不收敛。

  2. Choose \(\alpha\)

    try …,0.001,0.003,0.01,0.03,0.1,0.3,1,…

Features and polynomial regression 特征值和多项式回归

Housing prices prediction

\(h_\theta(x) = \theta_0 + \theta_1 \times frontage + \theta_2 \times depth\)

--> Land area: \(x = frontage \times depth\) --> $h_\theta(x) = \theta_0 + \theta_1x $

有时从另一个角度去审视问题,定义一个新的特征值,而不是直接使用开始时使用的特征值,确实会得到一个更好的模型。

Polynomial regression 多项式回归

例如:当直线不能很好的拟合曲线时,选择二次/三次模型。

\(h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x3 = \theta_0 + \theta_1(size) + \theta_2(size)^2 + \theta_3(size)^3\)
其中,\(x_1 = (size), x_2 = (size)^2, x_3 = (size)^3\)

Normal equation 标准方程法

Overview

Method to solve for \(\theta\) analytically 一种求\(\theta\)的解析解法。

与梯度下降法不同的是,此方法可直接一次性求解\(\theta\)的最优值。

Intuition

  1. If 1D(\(\theta \in R\)) 如果\(\theta\)是个实数

    \(\frac{d}{d\theta}J(\theta) = 0\) \(\rightarrow\) \(\theta\)

  2. \(\theta \in R^{n+1}\), \(J(\theta_0,\theta_1,\ldots,\theta_m) = \frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2\)

    \(\frac{\partial}{\partial\theta_j}J(\theta) = 0\) (for every \(j\)) \(\rightarrow\) \(\theta_0, \theta_1,\ldots,\theta_n\)

  3. 写成向量组的形式后,可通过下面的公式直接计算,证明略。\(X\theta = y\), \(X^TX\theta = X^Ty\), \(\rightarrow\) \(\theta = (X^TX)^{-1}X^Ty\).

    pinv(X'*X)*X'*y

Advantages and disadvantages

m training examples, n features.

  1. Gradient Descent
    • Need to choose \(\alpha\). 需要选择学习速率\(\alpha\)
    • Needs many iterations. 需要多次迭代
    • Works well even \(n\) is large. 即使有很多特征变量也能运行的很好
  2. Normal Equation
    • No need to choose \(\alpha\). 不需要选择学习速率\(\alpha\)
    • Don't need to iterate. 不需要迭代
    • Need to compute \((X^TX)^{-1}\). 需要计算这一项(接近\(n^3\)
    • Slow if \(n\) is very large. 如果n很大则会很慢

Normal equation and non-invertibility

正规方程及它们的不可逆性

What if \(X^TX\) is non-invertible(singular/degenerate)?

\(X^TX\)不可逆时怎么办(为奇异矩阵或退化矩阵)?

  1. Redundant features (linearly dependent). 有多余的特征值时删掉多余部分

    E.g. \(x_1\) = size in \(feet^2\), \(x_2\) = size in \(m^2\)

  2. Too many features(e.g. \(m \leq n\)). 太多特征值

    Delete some features, or use regularization. 删除某些特征值或者对其进行正则化

猜你喜欢

转载自www.cnblogs.com/songjy11611/p/12191297.html