一、Model Representation
Or in house prices predicted, for example, a picture is worth a thousand words:
h
represents a from x
the y
function map.
二、Cost Function
Because univariate linear regression, it is assumed that the function is:
\ [H _ {\ Theta} (X) = \ theta_0 + \ theta_1x \]
So the next question is how to determine the parameters \ (\ theta_0 \) and \ (\ theta_1 \ ) ?
These two parameters will determine the predictive value of the gap between us and the training set of actual data model, which is modeling errors .
Then the regression, the cost function to select the following squared error function reasonable:
\ [J (\ theta_0, \ theta_1) = \ FRAC {. 1} {2M} \ sum_ {I =. 1} ^ {m} (H_ { \ Theta} ({^ X (I)}) - {^ Y (I)}) ^ 2 \]
m is the number of samples in the training set, \ (^ {X (I)} \) is the size of each house , \ (^ {Y (I)} \) is the actual price.
Just look for such \ (J (\ theta_0, \ theta_1) \) minimum parameters.
The reason for dividing by two, mainly to offset the 2 square during the subsequent derivation of the gradient descent method.
三、Gradient Descent
To find the minimum of the cost function using a gradient descent method.
- Calculated by a random combination of parameters \ (J \)
- Find that a \ (J \) decreased most combination of parameters update the parameters, until a local optimal solution
Like down, like every step, each time you select the fastest decline direction until a local minimum.
In the batch gradient descent algorithm (all training samples should be used), the synchronization updates all parameters:
\ (\ Alpha \) is the learning rate, indicate how long each step of the walk.
If \ (\ alpha \) is too small, then the update process will be very slow; if \ (\ alpha \) is too large, it may skip the lowest point, leading to divergence.
When approaching a local optimum, since the slope will become increasingly smaller, so that each step will automatically go small, no need to reduce the learning rate \ (\ Alpha \) .
四、Gradient Descent For Linear Regression
Prior to the regression model using a gradient descent algorithm:
for \ (J (\ theta_0, \ theta_1) \) requirements on \ (\ theta_0 \) , \ (\ theta_1 \) partial derivatives, into the parameter updating formula, are: