[] Least-squares linear regression and probability theory to explain ridge regression

background:

Consider a polynomial fitting problem, as shown below, the green line is the equation sin (2πx) sin⁡ (2πx), the green line is a blue dot and add noise (the noise is normally distributed default) generation. That it is known by the NN training set consisting of points x = (x1, ... xN) Tx = (x1, ... xN) T, and the target value corresponding to the point t = (t1, ... tN ) Tt = (t1, ... tN) T. The goal now is: According to the blue dot to fit a curve, while the green line is that we want the final fitting result.


Question:
Suppose we eventually fitted curve is in the following order equation MM, the following equation:
Y (X, W) = W0 + + w2x2 + ... + WlX wMxM = Σj = 0Mwjxj (Equation. 1)
Y (X, w) = w0 + w1x + w2x2 + ... + wMxM = Σj = 0Mwjxj ( equation 1)

Wherein ww is the coefficient of the equation is our final desired object;
we typically do function using a least squares error (error function, which is a narrow loss function loss function), which formula is as follows:
E (W) = 12Σi = 1N {y (xn , w) -tn} 2 ( equation 2)
E (W) = 12Σi = 1N {y (xn , w) -tn} 2 ( equation 2)

Tntn is where the real value of these points, that figure above the blue dot, our goal is to obtain a set of ww the E (w) E (w) the minimum value;
this seems to be a natural thing, but if it is correct? Why correct? Why can not directly or residual cumulative residuals loss function as an absolute value, the following equation?
E (W) = 1N 12Σi = | Y (Xn, W) -tn |
E (W) = 1N 12Σi = | Y (Xn, W) -tn |

When using a least-squares error function, we lack an explanation of the formula, the following paper to explain the reasons behind the least-squares from the perspective of probability theory.
Probability theory explains the least squares method:
There is a hypothesis: a point observations of its true value of Gaussian distribution mean and variance β-1β-1 (β- 1 = σ2β-1 = σ2) ; that is the default our error belonging to a Gaussian distribution, i.e. written in mathematical expression:
P (T | X, W, beta]) = N (T | Y (X, W),. 1-beta]) (equation. 3)
P (T | X , w, β) = N ( t | y (x, w), β-1) ( equation 3)

If each xx is independent and identically distributed, then the observed value for the maximum likelihood function tt, namely:
P (T | X, W, beta]) = = Πn 1NN (TN | Y (Xn, W), beta] -1) (equation 4)
P (T | X, W, beta]) = = 1NN Πn (TN | Y (Xn, W),. 1-beta]) (equation 4)

Logarithmic likelihood function, namely:
LNP (T | X, W, beta]) = Sigma] n = 1NlnN (TN | Y (Xn, W),. 1-beta])
ln⁡p (T | X, W, beta]) = Σn = 1NlnN (tn | y (xn, w), β-1)

Namely:
LNP (T | X, W, beta]) = - Y β2Σn = 1N {(Xn, W)} 2 + N2lnβ -tn-N2ln (2 [pi]) (Equation. 5)
ln⁡p (T | X, W , β) = - β2Σn = 1N {y (xn, w) -tn} 2 + N2ln⁡β-N2ln⁡ (2π) ( equation 5)

The goal is the maximum value of Equation 5, since the ultimate requirement is ww, and therefore it becomes the final equation for the minimum of 6, i.e.:
Sigma] n = 1N {Y (Xn, W) 2 -tn} (Equation 6)
[Sigma n = 1N {y (xn, w) -tn} 2 ( equation 6)
this is actually the beginning of a least square method!
Summary 1:
Solution essentially solve the maximum likelihood function, and the default residue belonging to a Gaussian distribution using the least squares method.

Probability theory to explain ridge regression:
We have added a priori probability on the basis of the above: ww function fitting parameters belong to a multivariate Gaussian distribution with mean zero, essentially the difference in limiting ww not be too large, namely:
p (w | α) = N (w | 0, α-1I) = (α2π) (M + 1) / 2exp {-α2wTw} ( equation. 7)
P (W | [alpha]) = N (W | 0, [alpha] -1I) = (α2π) (M + 1) / 2exp {-α2wTw} ( equation 7)
on the logging of formula 7, namely:
LNP (W | [alpha]) = M + 12lnα2π-α2WTW (equation. 8)
LNP (W | α) = M + 12lnα2π- α2WTW ( equation 8)
Since (this is another way of expressing a Bayesian function):
posterior probability * = a priori probability likelihood function (equation 9)
posterior probability = a priori probability * likelihood function (equation 9)
therefore:
P (W | X, T, [alpha], beta]) is proportional to p (t | x, w, β) p (w | α) ( equation 10)
P (W | X , t, α, β) is proportional to p (t | x, w, β) p (w | α) ( equation 10)
now we can by known conditions, be determined by the posterior probability of the most likely ww, i.e., selecting the maximum value of formula 10. Equation 10 takes the maximum value of the negative logarithm of formula left, and into Equation 5 and Equation 8, Equation 10 may be required is equivalent to finding the minimum of the formula, namely:
β2Σn Y = 1N {(Xn, W )} 2 + -tn α2wTw
β2Σn Y = 1N {(Xn, W)} 2 + α2wTw -tn
summary 2:
is the maximum value of the posterior probability ridge regression essentially solved, and the parameter is added priori ww in line with multivariate Gaussian distribution.

Maximum likelihood estimation (MLE) and the maximum a posteriori (MAP):
In the interpretation of the least squares method using probability theory, we are using the MLE, that is, find the maximum likelihood function; in probability theory explained when ridge regression, we are using MAP, the maximum posterior probability that is obtained.

reference:

https://blog.csdn.net/liu_sn/article/details/79591146

https://blog.csdn.net/freedom098/article/details/56489238

Guess you like

Origin www.cnblogs.com/junneyang/p/12098124.html