Machine Learning (Machine Learning) - Andrew Ng (Andrew Ng) study notes (d)

Linear Regression with the Variables Multiple Multivariate linear regression

Multiple features

Notation Symbol Description

  1. \ (n-\) = Number of Features. the number of feature quantity
  2. \ (^ {X (i)} \) = INPUT (Features) of \ (TH i ^ {} \) Training Example. i-th training sample
  3. \ (x_j {(i)} \) = value of Feature \ (j \) in \ (TH i ^ {} \) Training Example. i-th feature quantity j-th training sample

Hypothesis

  1. Previously: \(h_\theta(x) = \theta_0 + \theta_1x\)
  2. Four features: \(h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + \theta_4x_4\)
  3. Multiple Features: \ (\ theta_0 + \ theta_1x_1 + \ theta_2x_2 + \ ldots + \ theta_nx_n \)
    the For Convenience of Notation, DEFINE \ (x_0 =. 1 (x_0 ^ {(I)} =. 1) \) ,
    the Then \ (H_ \ Theta (the X-) = \ theta_0x0 + \ theta_1x_1 + \ theta_2x_2 + \ ldots + \ theta_nx_n = \ Theta ^ Tx \) .
    we call it multiple linear regression.

Gradient descent for multiple variables

Hypothesis

\(h_\theta(x) = \theta^Tx = \theta_0x0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n\)

Parameters Parameters

\(\theta_0, \theta_1, \ldots, \theta_n\) --> (n + 1) - dimensional vector

Cost function cost function

\(J(\theta_0, \theta_1, \ldots, \theta_n) = \frac{1}{2m}\sum^m_{i = 1}(h_\theta(x^{(i)}) - y^{(i)})^2\) --> (n + 1) - dimensional vector function

Gradient descent gradient descent algorithm

  1. Repeat {

    ​ $\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\ldots,\theta_n) $

    }

  2. Previously (n = 1):

    Repeat {

    \(\theta_0 := \theta_0 - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})\)

    \(\theta_1 := \theta_1 - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}\)

    }

  3. Algorithm New ( \ (n \ geq 1 \) ):

    Repeat {

    \(\theta_j := \theta_j - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}\)

    }

Gradient descent in practice I: Frature Scaling

Gradient descent calculation of practical skills I: wherein scaling

Feature Scaling zoom feature

  1. Idea:. Make sure features are on a similiar scale to ensure that the eigenvalues ​​have similar scale.

  2. E.g. \(x_1\) = size(0-2000\({feet}^2\)), \(x_2\) = number of bedrooms(1-5)

    ---> \(x_1\) = \(\frac{size({feet}^2)}{2000}\), \(x_2\) = \(\frac{number of bedrooms}{5}\)

  3. Approximately Every Feature INTO A GET \ (-. 1 \ Leq x_i \ Leq. 1 \) . Range Too Small or Large IS TOO Not Acceptable proximity allows the [-1,1] range of each feature value.

    \ (- 100 \ leq x_i \ leq 100 \) or \ (- 0.0001 \ leq x_i \ leq 0.0001 \) (×)

Mean normalization uniform

  1. The replace \ (x_i \) with \ (x_i - \ mu_i \) to the make Features have have approxinately ZERO Mean (the Do not the Apply to \ (x_0 = 1 \) .) With \ (x_i - \ mu_i \) instead of \ (x_i \ ) such that the characteristic values are close to zero average value.

  2. E.g. \(x_1 = \frac{size - 1000}{2000}\), \(x_2 = \frac{bedrooms - 2}{5}\). --> \(-0.5 \leq x_1 \leq 0.5\), \(-0.5 \leq x_2 \leq 0.5\).

  3. More general rule: \ (x_1 = \ {x_1 FRAC - \ Sl mu_1} {} \) .

    (\ mu_i: x \) \ average value, \ (S_1: \) range of characteristic values - maximum minus the minimum (or regarded as the standard variable differential).

Gradient descent in practice II: learning rate

Gradient descent calculation of practical skills II: learning rate

Making sure gradient descent is working correcrly

  1. \ (J (\ theta) \ ) Should Decrease After Every Iteration. By observing \ (J (\ theta) \ ) curve with increasing number of iterations changes described gradient descent algorithm when the curve becomes nearly a straight line is convergence.

    [Note] for each particular issue, the number of iterations required gradient descent algorithm can vary greatly.

  2. Example automatic convergence test: automatic convergence test

    Convergence IF DECLARE \ (J (\ Theta) \) Decrease by less Within last \ (10 ^ {-}. 3 \) in One Iteration If the cost function. \ (J (\ Theta) \) is less than a small number of drops \ (\ varepsilon \) , then it is considered to have converged.

    [Note] usually choose a suitable \ (\ varepsilon \) is very difficult, it is usually a method.

Choose learning rate \(\alpha\)

  1. Summary

    IF \ (\ Alpha \) IS TOO Small: Convergence if SLOW. \ (\ Alpha \) is too small will lead to slow convergence.

    IF \ (\ Alpha \) IS TOO Large: \ (J (\ theta) \ ) May not Decrease ON Every the Iteration; May not Converge if. \ (\ Alpha \) up is too \ (J (\ theta) \ ) We are not reduced or does not converge at each step.

  2. Choose \(\alpha\)

    try …,0.001,0.003,0.01,0.03,0.1,0.3,1,…

Features and polynomial regression and polynomial regression eigenvalues

Housing prices prediction

\(h_\theta(x) = \theta_0 + \theta_1 \times frontage + \theta_2 \times depth\)

--> Land area: \(x = frontage \times depth\) --> $h_\theta(x) = \theta_0 + \theta_1x $

Sometimes to look at the problem from another angle, defining a new characteristic value , rather than directly using the feature value used at the beginning, we did get a better model.

Polynomial regression polynomial regression

For example: when the linear fitting curve is not well, select two / three models.

\(h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x3 = \theta_0 + \theta_1(size) + \theta_2(size)^2 + \theta_3(size)^3\)
其中,\(x_1 = (size), x_2 = (size)^2, x_3 = (size)^3\)

Normal equation standard equation method

Overview

To Solve for Method, \ (\ Theta \) analytically for seeking \ (\ Theta \) analytic solution.

Gradient descent method is different, this method can be solved directly disposable \ (\ Theta \) optimum value.

Intuition

  1. 1D IF ( \ (\ Theta \ in R & lt \) ) If \ (\ Theta \) is a real number

    \(\frac{d}{d\theta}J(\theta) = 0\) \(\rightarrow\) \(\theta\)

  2. \(\theta \in R^{n+1}\), \(J(\theta_0,\theta_1,\ldots,\theta_m) = \frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2\)

    \(\frac{\partial}{\partial\theta_j}J(\theta) = 0\) (for every \(j\)) \(\rightarrow\) \(\theta_0, \theta_1,\ldots,\theta_n\)

  3. After written in the form of vectors, can be directly calculated by the following equation, proved slightly. \ (X-\ Theta = Y \) , \ (X-^ the TX \ Theta = X-^ Ty \) , \ (\ rightarrow \) \ (\ Theta = (X-^ the TX) ^ {-. 1} X-^ Ty \) .

    pinv(X'*X)*X'*y

Advantages and disadvantages

m training examples, n features.

  1. Gradient Descent
    • To the Choose Need \ (\ alpha \) . Need to choose the learning rate \ (\ alpha \)
    • We need many iterations. Requires multiple iterations
    • Well the even Works \ (n-\) IS Large. Even if there are many variable characteristics can run well
  2. Normal Equation
    • Need to the Choose NO \ (\ alpha \) . Do not need to select the learning rate \ (\ alpha \)
    • Do not need to iterate. No iteration
    • To Compute Need \ ((X-the TX ^) ^ {-}. 1 \) . It is necessary to calculate a (close \ (n-^. 3 \) )
    • IF SLOW \ (n \) IS Very Large. If n is large it will be very slow

Normal equation and non-invertibility

Normal equations and their irreversibility

What if \(X^TX\) is non-invertible(singular/degenerate)?

When \ (X ^ TX \) how to do (is singular or singular matrix) irreversible?

  1. Redundant features (linearly dependent). Deletion excess portion when excess eigenvalues

    E.g. \(x_1\) = size in \(feet^2\), \(x_2\) = size in \(m^2\)

  2. Features MANY TOO (EG \ (m \ n-Leq \) ). Too many eigenvalues

    Delete some features, or use regularization. Delete certain characteristic value or its regularization

Guess you like

Origin www.cnblogs.com/songjy11611/p/12191297.html