#Week3 Linear Regression with Multiple Variables

A, Multiple Features

This lesson mainly introduces some tokens, assuming there are n features, then:
Here Insert Picture Description
To facilitate processing matrix, so \ (= x_0. 1 \) :
Here Insert Picture Description
Parameter \ (\ Theta \) is a (n+1)*1vector dimension, either of a training sample also (n+1)*1dimensional vector, so that for each training sample: \ (H_ \ Theta (X) = \ Theta the Tx ^ \) .

二、Gradient Decent for Multiple Variables

Similarly, the cost function is defined:
Here Insert Picture Description
while updating parameters until \ (J \) Convergence:
\ [\ theta_j: = \ theta_j- \ Alpha \ FRAC. 1 {m}} {\ sum_ {I} = ^ {m}. 1 (H_ \ theta (x ^ {(i )}) - y ^ {(i)}) x_j ^ {(i)} \]

Three, Feature Scaling

If the value of these characteristics have a similar scale, the gradient descent converges faster, in fact, normalization.
Andrew feature recommended scaling value between [-1,1]:
\ [x_i = \ {x_i-FRAC S_I u_i} {}, is an average value u_i, s_i can take max-min or taken standard deviation \]

四、Learning Rate

1, the number of iterations required gradient descent is uncertain convergence, by rendering the number of iterations and \ (J \) to predict when FIG convergence; can also change in the cost function is less than a certain threshold value is determined.
2, learning rate generally try 0.001,0.003,0.01,0.03,0.1,0.3,1 ...

五、Features and Polynomial Regression

Sometimes linear regression does not apply, sometimes need polynomial regression.
Polynomial regression can be converted into a linear regression.

六、Normal Equation

Normal equation by direct derivation, such that the derivative is 0, then obtain \ (\ Theta \) analytical solution, so that the \ (J \) minimum, without the need iterates as gradient descent.
Here Insert Picture Description
X is a m*(n+1)characteristic matrix, y is the m*1vector easily derived from FIG:
\ (X = Y \ Theta \) (This equation is clearly wrong ... \ (Y \) is only collected tag), a solution of \ ( \ X-Theta = ^ {-}. 1 Y \) (so the conclusion is wrong), thus obtained \ (\ Theta \) obviously can not function to minimize loss.
Curriculum written \ (\ Theta = (X-the TX ^) ^ {-}. 1 X-Ty ^ \) , detailed derivation of the cost function is obtained by seeking to give a guide. This formula can be simplified as \ (\ ^ {X-Theta = -. 1} Y \) , because only \ (X ^ T \) and \ (X-\) are reversible, only \ ((X ^ TX) ^ { X-^} = {-1 -}. 1 (X-T ^) ^ {-}. 1 \) .

Comparison of two algorithms:
Here Insert Picture Description
the normal equation is only applicable to linear models , and does not require Feature Scaling.

Guess you like

Origin www.cnblogs.com/EIMadrigal/p/12130856.html