Machine learning portal (B) of the linear regression ----

At 18:22 on September 22, 2019: 52- 22 September, 2019 18:59:51

For a given problem living area and number of bedrooms predict house prices. We can build a linear regression model. Is assumed to have the form,
\ [H (X) = \ sum_ {I = 0} ^ {n-} \ theta_ {I} X_ {I} = \ Theta ^ {T} X \]
wherein \ (\ theta_ {i} \) is called parameters (parameters), also known as weights (weights). \ (x_ {0} = 1 \) is the intercept (intercept).
Note that this notation is represented by a superscript denotes a sample index of a sample component, \ (n-\) is a variable number.
For the training data set, we want to assume the minimum sum of squares of the predicted value of the output deviation at the input, so the following loss functions (ecost function) is assumed to measure the performance on a training set,
\ [\ {the begin Equation } J (\ theta) = \ frac {1} {2} \ sum_ {i = 1} ^ {m} \ left (h _ {\ theta} \ left (x ^ {(i)} \ right) -y ^ {(i)} \ right) ^ {2} \ end {equation} \]

LMS algorithm (least mean square)

We use gradient descent (gradient descent) to minimize this loss, the loss function defined \ (J (\ theta) \ ) Thereafter, the function gradient descent update rules have given the following individual parameters,
\ [\ {the begin Equation } \ theta_ {j}: = \ theta_ {j} - \ alpha \ frac {\ partial} {\ partial \ theta_ {j}} J (\ theta) \ end {equation} \]

Now we have to find specific update formula on the linear regression problem, assume there is a sample $ \ left (the X-^ {}, the y-^ {} \ right) $, \
[\ the begin {aligned} \ FRAC {\ partial} {\ partial \ theta_ { j}} J (\ theta) & = \ frac {\ partial} {\ partial \ theta_ {j}} \ frac {1} {2} \ left (h _ {\ theta} ( x) -y \ right) ^ { 2} \\ & = 2 \ cdot \ frac {1} {2} \ left (h _ {\ theta} (x) -y \ right) \ cdot \ frac {\ partial} {\ partial \ theta_ {j} } \ left (h _ {\ theta} (x) -y \ right) \\ & = \ left (h _ {\ theta} (x) -y \ right) \ cdot \ frac { \ partial} {\ partial \ theta_ {j}} \ left (\ sum_ {i = 0} ^ {n} \ theta_ {i} x_ {i} -y \ right) \\ & = \ left (h _ {\ theta} (x) -y \ right
) x_ {j} \ end {aligned} \] Accordingly update rule to a sample (p \ (n-\) variables simultaneously, \ (J = 1,2, \ ldots , n-\) ), \
[\ theta_ {J}: = \ theta_ {J} + \ Alpha \ left (Y ^ {(I)} - H _ {\ Theta} \ left (X ^ {(I)} \ right) \ right) x_ {j
} ^ {(i)} \] the above update rule is also called minimum mean square errorUpdate rule (least mean square), or learning Widrow-Hoff rule, it is updated once a sample, called stochastic gradient descent or incremental gradient descent (stochastic gradient descent). Stochastic gradient descent, a little more fierce turbulence, but in general also good. When large data sets, this method is more pleasing than the latter method.

For all samples, there are similar update rule,
\ [\ the begin Equation} {\ theta_ {J}: = \ + theta_ {J} \ Alpha \ sum_. 1} = {I m} ^ {\ left (Y ^ { (i)} - h _ {
\ theta} \ left (x ^ {(i)} \ right) \ right) x_ {j} ^ {(i)} \ end {equation} \] this method in each step will be used to all the training samples, it is also called a batch gradient descent (batch gradient descent).
Gradient descent may converge to a local minimum, but the loss of function of a linear regression problem, a convex quadratic function, there is only one global minimum, must converge to the global minimum.

Guess you like

Origin www.cnblogs.com/qizhien/p/11568697.html