[Eye machine learning training camp depth of the fourth] linear regression

basic concepts

First look at the basic concepts and symbols. $ x ^ {(i)} $ denotes an input variable, which is characteristic , $ y ^ {(i) } $ represents the output variable, also called labels or targets . $ Both tuple consisting of (x ^ {(i)} , y ^ {(i)}) $ to represent a training sample, and n-$ $ of such training samples to form the training set, i.e. {$ ( x ^ {(i)}, y ^ {(i)}); i = 1, \ cdots, n} $. In addition, we use $ \ mathcal {X} $ represents a spatial input values, with the $ \ mathcal {Y} $ represents a spatial output value. Function maps an input value to the input value is called the hypothesis (hypothesis), composed of a collection of these mapping functions is called set of assumptions , expressed as $ h: \ mathcal {X} \ mapsto \ mathcal {Y} $. Linear function as an example, one possible hypothesis is:
$$
H_ \ Theta (X) = \ + theta_0 \ theta_. 1} {x_1 + \ theta_2 x_2
$$
wherein, $ \ $ theta_i hypothetical parameter (or referred to for the weight). Obviously, for different assumptions, the parameters are different.

Machine learning process is run on a training set of learning algorithms, centralized find a "good" hypothesis $ h ^ * $ from the hypothesis to predict the output value according to the input value. If we need to predict the output value is continuous, then we call it regression ; if you need predicted value is discrete, then that classification . Next, we need to solve two problems. The first problem is how to measure the assumption of "good" and "bad" ? The second problem is how to find a "good" hypothesis ?

Loss function

First look at the first question, how to measure the assumption of "good" and "bad." Clearly, an intuitive idea is the training set, the predicted value of $ h_ \ theta (x) and the error between the true value of $ $ $ Y as small as possible. Thus, we define a function to measure different parameters $ \ $ prediction value of each sample Theta $ h_ \ theta (x ^ { (i)}) $ and $ y ^ $ true value between {(i)} of gap. This function is the loss function . In the regression, the loss function is commonly used in the square error function:
$$
J (\ Theta) = \ FRAC. 1 {{}} 2N \ SUM {I}. 1 = n-^ \ left (H \ Theta (X_ {I} ) - Y_ {I} \ right) ^ 2
$$

Parameter learning

We already know how to measure the assumption of "good" and "bad", the following will need to solve the second problem: how to find "good" hypothesis, in other words, how to learn a "good" parameter $ \ theta $. We hope to be able to learn the parameters of the loss function $ J (\ theta) $ is minimized. Here are two ways to learn the parameters of introduction.

Gradient descent

It is a gradient descent optimization method. It first initializes parameters, then solving for the corresponding gradient of the objective function along the direction of the gradient descent update parameters fastest, until the objective function converges to a minimum. Single-step updating equation as follows:
$$
\ theta_j \ coloneqq \ theta_j - \ Alpha \ FRAC {\ partial} {\ partial \ theta_j} J (\ Theta)
$$
where $ \ $ Alpha is the learning rate , also known as step . Achieve critical gradient descent is to solve $ J (\ theta) $ of $ \ partial derivative theta_j $ a:
$$
\ the begin {the aligned}
\ FRAC {\ partial} {\ partial \ theta_j} J (\ Theta) & = \ FRAC {\ partial} {\ partial \ theta_j} \ FRAC {. 1} {2} (H_ \ Theta (X) - Y) ^ 2 \
& = 2 \ CDOT \ FRAC {. 1} {2} (H_ \ Theta ( X) - Y) \ CDOT \ FRAC {\ partial} {\ partial \ theta_j} (H_ \ Theta (X) - Y) \
& = (H_ \ Theta (X) - Y) \ CDOT \ FRAC {\ partial} {\ partial \ theta_j} \ left (\ sum_ I = {0}} ^ {D \ theta_i x_i - Y \ right) \
& = (H_ \ Theta (X) - Y) x_j
\ the aligned End {}
$$
Therefore updating rules for a single training samples are as follows:
$$
\theta_j \coloneqq \theta_j - \alpha\left(h_\theta(x^{(i)})-y^{(i)}\right)x_j^{(j)},\forall j\in{0,1,\cdots,d}
$$

It can also be written:
$$
\ Theta \ coloneqq \ Theta - \ Alpha \ left (H_ \ Theta (X ^ {(I)}) - {^ Y (I)} \ right) ^ {X (I)}
$$

When using the gradient descent algorithm, to be feature scaling and normalization. This two-step operation so that all the characteristic value is within a close range to ensure loss function $ J (\ theta) $ is not deflected.

Normal equation method

Since minimization $ J (\ theta) $ is a convex optimization problem, so $ J (\ theta) $ has a globally unique minimum. This means that we can directly calculate the analytical solution of the problem.

To solve this problem, we need to construct a training sample composed of all the matrix $ X $, each row of the matrix represents a training sample, each column represents different characteristics. Where $ is a X-$ $ n \ times (d + 1 ) $ dimensional matrix (containing intercept):
$$
X-= \ bmatrix the begin {}
- (X ^ {(. 1)}) ^ T - \
- (X ^ {(2)}) ^ T - \
\ vdots \
- (X ^ {(n-)}) T-^
\} End {bmatrix
$$
order $ \ vec {y} $ consisting of all true values $ $ n-dimensional vector:
$$
\ VEC = {Y} \} bmatrix the begin {
Y {^ (. 1)} \
Y ^ {(2)} \
\ vdots \
Y ^ {(n-)} \
\ {End bmatrix }
$$
since $ h_ \ theta (x ^ { (i)}) = (x ^ {(i)}) ^ T \ theta $, where $ \ Theta $ a $ d + 1 $ dimensional vector, it is the following form:
$$
\ the aligned the begin {}
X-\ Theta - \ {Y} & VEC = \ bmatrix the begin {}
({X ^ (. 1)}) ^ T \ Theta \
\ vdots \
(^ {X (n-) }) ^ T \ Theta \
\ bmatrix} End {- \} bmatrix the begin {
Y {^ (. 1)} \
\ vdots \
Y ^ {(n-)} \
\ End {bmatrix} \
& = \ the begin {bmatrix}
H_ \ Theta (X ^ {(. 1)}) - Y ^ {(. 1)} \
\ vdots \
H_ \ Theta (X ^ {(n-)}) - ^ {Y (n-)} \
\ bmatrix End {}
\ the aligned End {}
$$
for any vector $ z $, we have $ z ^ Tz = \ sum_ { i} z_i ^ 2 $. Accordingly, the $ J (\ theta) $ written in matrix form:
$$
\ the aligned the begin {}
J (\ Theta) = & \ FRAC. 1} {2} {\ sum_ {I}. 1 = n-^ {} \ left (h_ \ theta (x ^ { (i)}) - y ^ {(i)} \ right) ^ 2 \ & = \ frac {1} {2} (X \ theta- \ vec {y}) ^ T (X-\ theta-\ VEC {Y})
\ the aligned End {}
$$
of $ J (\ theta) $ derivative obtained:
$$
\ the aligned the begin {}
\ nabla_ \ Theta J (\ Theta) = & \ nabla_ \ theta \ frac {1} { 2} (X \ theta- \ vec {y}) ^ T (X \ theta- \ vec {y}) \
& = \ Frac {1} { 2} \ left ((X \ theta) ^ TX \ theta- (X \ theta) ^ T \ vec {y} - \ vec {y} ^ T (X \ theta) + \ T ^ {Y} VEC \ VEC {Y} \ right) \
& = \ FRAC. 1} {2} {\ left (\ ^ T Theta (X-^ the TX) \ theta-\ VEC T ^ {Y} (X-\ Theta) - \ VEC T ^ {Y} (X-\ Theta) \ right) \
& = \ FRAC. 1} {2} {\ left (\ ^ T Theta (X-^ the TX) \ theta-2 (\ Y {VEC } ^ the TX) \ Theta \ right) \
& = \ FRAC {. 1} {2} \ left (\ Theta ^ T (X-^ the TX) \ Theta-2 (X-^ T \ VEC {Y}) ^ T \ Theta \ right) \
& = \ FRAC {. 1} {2} (2X ^ the TX \ Theta-2X ^ T \ VEC {Y}) \
& = X-^ the TX \ Theta-X-^ T \ VEC {Y}
\ End { } the aligned
$$
to solve $ J (\ theta) $ minimum, make the derivative equal to zero, the normal equation to obtain:
$$
X-the TX ^ \ ^ X-Theta = T \ {Y} VEC
$$
Accordingly, $ J ( \ Theta) closed form solution for the minimum $:
$$
\ Theta = (X-the TX ^) ^ {-}. 1 ^ X-T \ {Y} VEC
$$

When the $ X ^ TX $ irreversible, we need to carefully check the characteristics of the training set, remove redundant features a strong correlation; or use a regularization technique. Additionally, you can solve the pseudo-inverse of $ X ^ TX $.

Guess you like

Origin www.cnblogs.com/littleorange/p/12175529.html