A, Multiple Features
This lesson mainly introduces some tokens, assuming there are n features, then:
To facilitate processing matrix, so \ (= x_0. 1 \) :
Parameter \ (\ Theta \) is a (n+1)*1
vector dimension, either of a training sample also (n+1)*1
dimensional vector, so that for each training sample: \ (H_ \ Theta (X) = \ Theta the Tx ^ \) .
二、Gradient Decent for Multiple Variables
Similarly, the cost function is defined:
while updating parameters until \ (J \) Convergence:
\ [\ theta_j: = \ theta_j- \ Alpha \ FRAC. 1 {m}} {\ sum_ {I} = ^ {m}. 1 (H_ \ theta (x ^ {(i )}) - y ^ {(i)}) x_j ^ {(i)} \]
Three, Feature Scaling
If the value of these characteristics have a similar scale, the gradient descent converges faster, in fact, normalization.
Andrew feature recommended scaling value between [-1,1]:
\ [x_i = \ {x_i-FRAC S_I u_i} {}, is an average value u_i, s_i can take max-min or taken standard deviation \]
四、Learning Rate
1, the number of iterations required gradient descent is uncertain convergence, by rendering the number of iterations and \ (J \) to predict when FIG convergence; can also change in the cost function is less than a certain threshold value is determined.
2, learning rate generally try 0.001,0.003,0.01,0.03,0.1,0.3,1 ...
五、Features and Polynomial Regression
Sometimes linear regression does not apply, sometimes need polynomial regression.
Polynomial regression can be converted into a linear regression.
六、Normal Equation
Normal equation by direct derivation, such that the derivative is 0, then obtain \ (\ Theta \) analytical solution, so that the \ (J \) minimum, without the need iterates as gradient descent.
X is a m*(n+1)
characteristic matrix, y is the m*1
vector easily derived from FIG:
\ (X = Y \ Theta \) (This equation is clearly wrong ... \ (Y \) is only collected tag), a solution of \ ( \ X-Theta = ^ {-}. 1 Y \) (so the conclusion is wrong), thus obtained \ (\ Theta \) obviously can not function to minimize loss.
Curriculum written \ (\ Theta = (X-the TX ^) ^ {-}. 1 X-Ty ^ \) , detailed derivation of the cost function is obtained by seeking to give a guide. This formula can be simplified as \ (\ ^ {X-Theta = -. 1} Y \) , because only \ (X ^ T \) and \ (X-\) are reversible, only \ ((X ^ TX) ^ { X-^} = {-1 -}. 1 (X-T ^) ^ {-}. 1 \) .
Comparison of two algorithms:
the normal equation is only applicable to linear models , and does not require Feature Scaling.