Linear Regression with the Variables Multiple Multivariate linear regression
Multiple features
Notation Symbol Description
- \ (n-\) = Number of Features. the number of feature quantity
- \ (^ {X (i)} \) = INPUT (Features) of \ (TH i ^ {} \) Training Example. i-th training sample
- \ (x_j {(i)} \) = value of Feature \ (j \) in \ (TH i ^ {} \) Training Example. i-th feature quantity j-th training sample
Hypothesis
- Previously: \(h_\theta(x) = \theta_0 + \theta_1x\)
- Four features: \(h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + \theta_4x_4\)
- Multiple Features: \ (\ theta_0 + \ theta_1x_1 + \ theta_2x_2 + \ ldots + \ theta_nx_n \)
the For Convenience of Notation, DEFINE \ (x_0 =. 1 (x_0 ^ {(I)} =. 1) \) ,
the Then \ (H_ \ Theta (the X-) = \ theta_0x0 + \ theta_1x_1 + \ theta_2x_2 + \ ldots + \ theta_nx_n = \ Theta ^ Tx \) .
we call it multiple linear regression.
Gradient descent for multiple variables
Hypothesis
\(h_\theta(x) = \theta^Tx = \theta_0x0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n\)
Parameters Parameters
\(\theta_0, \theta_1, \ldots, \theta_n\) --> (n + 1) - dimensional vector
Cost function cost function
\(J(\theta_0, \theta_1, \ldots, \theta_n) = \frac{1}{2m}\sum^m_{i = 1}(h_\theta(x^{(i)}) - y^{(i)})^2\) --> (n + 1) - dimensional vector function
Gradient descent gradient descent algorithm
Repeat {
$\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\ldots,\theta_n) $
}
Previously (n = 1):
Repeat {
\(\theta_0 := \theta_0 - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})\)
\(\theta_1 := \theta_1 - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}\)
}
Algorithm New ( \ (n \ geq 1 \) ):
Repeat {
\(\theta_j := \theta_j - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}\)
}
Gradient descent in practice I: Frature Scaling
Gradient descent calculation of practical skills I: wherein scaling
Feature Scaling zoom feature
Idea:. Make sure features are on a similiar scale to ensure that the eigenvalues have similar scale.
E.g. \(x_1\) = size(0-2000\({feet}^2\)), \(x_2\) = number of bedrooms(1-5)
---> \(x_1\) = \(\frac{size({feet}^2)}{2000}\), \(x_2\) = \(\frac{number of bedrooms}{5}\)
Approximately Every Feature INTO A GET \ (-. 1 \ Leq x_i \ Leq. 1 \) . Range Too Small or Large IS TOO Not Acceptable proximity allows the [-1,1] range of each feature value.
\ (- 100 \ leq x_i \ leq 100 \) or \ (- 0.0001 \ leq x_i \ leq 0.0001 \) (×)
Mean normalization uniform
The replace \ (x_i \) with \ (x_i - \ mu_i \) to the make Features have have approxinately ZERO Mean (the Do not the Apply to \ (x_0 = 1 \) .) With \ (x_i - \ mu_i \) instead of \ (x_i \ ) such that the characteristic values are close to zero average value.
E.g. \(x_1 = \frac{size - 1000}{2000}\), \(x_2 = \frac{bedrooms - 2}{5}\). --> \(-0.5 \leq x_1 \leq 0.5\), \(-0.5 \leq x_2 \leq 0.5\).
More general rule: \ (x_1 = \ {x_1 FRAC - \ Sl mu_1} {} \) .
(\ mu_i: x \) \ average value, \ (S_1: \) range of characteristic values - maximum minus the minimum (or regarded as the standard variable differential).
Gradient descent in practice II: learning rate
Gradient descent calculation of practical skills II: learning rate
Making sure gradient descent is working correcrly
\ (J (\ theta) \ ) Should Decrease After Every Iteration. By observing \ (J (\ theta) \ ) curve with increasing number of iterations changes described gradient descent algorithm when the curve becomes nearly a straight line is convergence.
[Note] for each particular issue, the number of iterations required gradient descent algorithm can vary greatly.
Example automatic convergence test: automatic convergence test
Convergence IF DECLARE \ (J (\ Theta) \) Decrease by less Within last \ (10 ^ {-}. 3 \) in One Iteration If the cost function. \ (J (\ Theta) \) is less than a small number of drops \ (\ varepsilon \) , then it is considered to have converged.
[Note] usually choose a suitable \ (\ varepsilon \) is very difficult, it is usually a method.
Choose learning rate \(\alpha\)
Summary
IF \ (\ Alpha \) IS TOO Small: Convergence if SLOW. \ (\ Alpha \) is too small will lead to slow convergence.
IF \ (\ Alpha \) IS TOO Large: \ (J (\ theta) \ ) May not Decrease ON Every the Iteration; May not Converge if. \ (\ Alpha \) up is too \ (J (\ theta) \ ) We are not reduced or does not converge at each step.
Choose \(\alpha\)
try …,0.001,0.003,0.01,0.03,0.1,0.3,1,…
Features and polynomial regression and polynomial regression eigenvalues
Housing prices prediction
\(h_\theta(x) = \theta_0 + \theta_1 \times frontage + \theta_2 \times depth\)
--> Land area: \(x = frontage \times depth\) --> $h_\theta(x) = \theta_0 + \theta_1x $
Sometimes to look at the problem from another angle, defining a new characteristic value , rather than directly using the feature value used at the beginning, we did get a better model.
Polynomial regression polynomial regression
For example: when the linear fitting curve is not well, select two / three models.
\(h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x3 = \theta_0 + \theta_1(size) + \theta_2(size)^2 + \theta_3(size)^3\)
其中,\(x_1 = (size), x_2 = (size)^2, x_3 = (size)^3\)
Normal equation standard equation method
Overview
To Solve for Method, \ (\ Theta \) analytically for seeking \ (\ Theta \) analytic solution.
Gradient descent method is different, this method can be solved directly disposable \ (\ Theta \) optimum value.
Intuition
1D IF ( \ (\ Theta \ in R & lt \) ) If \ (\ Theta \) is a real number
另\(\frac{d}{d\theta}J(\theta) = 0\) \(\rightarrow\) \(\theta\)
\(\theta \in R^{n+1}\), \(J(\theta_0,\theta_1,\ldots,\theta_m) = \frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2\)
另\(\frac{\partial}{\partial\theta_j}J(\theta) = 0\) (for every \(j\)) \(\rightarrow\) \(\theta_0, \theta_1,\ldots,\theta_n\)
After written in the form of vectors, can be directly calculated by the following equation, proved slightly. \ (X-\ Theta = Y \) , \ (X-^ the TX \ Theta = X-^ Ty \) , \ (\ rightarrow \) \ (\ Theta = (X-^ the TX) ^ {-. 1} X-^ Ty \) .
pinv(X'*X)*X'*y
Advantages and disadvantages
m training examples, n features.
- Gradient Descent
- To the Choose Need \ (\ alpha \) . Need to choose the learning rate \ (\ alpha \)
- We need many iterations. Requires multiple iterations
- Well the even Works \ (n-\) IS Large. Even if there are many variable characteristics can run well
- Normal Equation
- Need to the Choose NO \ (\ alpha \) . Do not need to select the learning rate \ (\ alpha \)
- Do not need to iterate. No iteration
- To Compute Need \ ((X-the TX ^) ^ {-}. 1 \) . It is necessary to calculate a (close \ (n-^. 3 \) )
- IF SLOW \ (n \) IS Very Large. If n is large it will be very slow
Normal equation and non-invertibility
Normal equations and their irreversibility
What if \(X^TX\) is non-invertible(singular/degenerate)?
When \ (X ^ TX \) how to do (is singular or singular matrix) irreversible?
Redundant features (linearly dependent). Deletion excess portion when excess eigenvalues
E.g. \(x_1\) = size in \(feet^2\), \(x_2\) = size in \(m^2\)
Features MANY TOO (EG \ (m \ n-Leq \) ). Too many eigenvalues
Delete some features, or use regularization. Delete certain characteristic value or its regularization