LASSO regression and Ridge Regression

In linear regression teaches the principles in order to prevent over-fitting often join regularization term. Common regularization regularization has L1 and L2 regularization.

1.LASSO return

Add L1 regularization term LASSO regression called linear regression. L1 regularization term that is the L1 norm parameters, popular point that is the parameter vector of each component of the absolute value of the sum, i.e., for the \ (\ theta = (\ theta_0 , \ theta_1, \ cdots, \ theta_n) ^ T \) parameter vector, L1 is a regularization term:

\[ \left \| \theta \right \|_1 = \sum_{j=0}^n | \theta_j | \]

Usually adding a factor \ (\ lambda \) to adjust the regularization term weight, so LASSO regression of the objective function (loss function) is:

\[ J(\theta) = \frac{1}{2}\sum_{i=1}^m(h(x^{(i)})-y^{(i)})^2 + \lambda \sum_{j=0}^n | \theta_j | = \frac{1}{2}\left(X\theta-Y\right)^T\left(X\theta-Y\right) + \lambda\left \| \theta \right \|_1 \]

LASSOS regression coefficients can be zero such that some of the features (i.e., some \ (\ theta_j \) is zero), i.e., to obtain sparse solution.

Since \ (| \ theta_j | \) seek not lead, so in practical applications, may seek an approximate solution. For the function \ (F (X; \ Alpha) = X + \ FRAC. 1 {{} \} Alpha \ log (1+ \ exp (- \ Alpha X)), X \ GE 0 \) , the absolute value can be approximated for:

\[ \begin{aligned} |x| &\approx f(x;\alpha) + f(-x;\alpha)\\ & = x + \frac{1}{\alpha}\log(1+\exp(-\alpha x)) - x + \frac{1}{\alpha}\log(1+\exp(\alpha x))\\ & = \frac{1}{\alpha}\left(\log(1+\exp(-\alpha x)) + \log(1 + \exp(\alpha x))\right)\\ \end{aligned} \]

Thus, \ (| X | \) of the gradient and the second derivative can be determined by the above approximate expression:

\[ \begin{aligned} \nabla |x| &\approx \frac{1}{\alpha}\left( \frac{-\alpha \exp(-\alpha x)}{1+\exp(-\alpha x)} + \frac{\alpha \exp(\alpha x)}{1+\exp(\alpha x)} \right)\\ &=\frac{\exp(\alpha x)}{1+\exp(\alpha x)} - \frac{\exp(-\alpha x)}{1+\exp(-\alpha x)}\\ &=\left( 1 - \frac{1}{1+\exp(\alpha x)} \right) - \left( 1 - \frac{1}{1+\exp(-\alpha x)} \right)\\ & = \frac{1}{1+\exp(-\alpha x)} - \frac{1}{1+\exp(\alpha x)}\\ \nabla^2 |x| &\approx \nabla( \frac{1}{1+\exp(-\alpha x)} - \frac{1}{1+\exp(\alpha x)})\\ &=\frac{\alpha\exp(-\alpha x)}{(1+\exp(-\alpha x))^2} + \frac{\alpha \exp(\alpha x)}{(1+\exp(\alpha x))^2}\\ &=\frac{\alpha\frac{1}{\exp(\alpha x)}}{(1+\frac{1}{\exp(\alpha x)})^2} + \frac{\alpha \exp(\alpha x)}{(1+\exp(\alpha x))^2}\\ &=\frac{\alpha\exp(\alpha x)}{(1+\exp(\alpha x))^2} + \frac{\alpha \exp(\alpha x)}{(1+\exp(\alpha x))^2}\\ &=\frac{2\alpha \exp(\alpha x)}{(1+\exp(\alpha x))^2} \end{aligned} \]

For general problem, \ (\ Alpha \) and generally \ (6 10 ^ \) .

Using an approximate gradient, the first derivative of the objective function is obtained:

\[ \frac{\partial J(\theta)}{\partial \theta} \approx X^TXθ−X^TY + \frac{\lambda}{1+\exp(-\alpha \theta)} - \frac{\lambda}{1+\exp(\alpha \theta)} \]

Obviously, the obtained derivative is zero is not good \ (\ Theta \) values, and so are generally descent regression method to solve the minimum angle with the axis.

2.Ridge return

Ridge regression is the addition of L2 positive linear regression of the term. Positive L2 that is the number of parameters of the L2 norm, for \ (\ theta = (\ theta_0 , \ theta_1, \ cdots, \ theta_n) ^ T \) parameter vector, L2 is a regularization term:

\[ \left \| \theta \right \|_2^2 = \sum_{j=0}^n \theta_j^2 \]

Ridge regression of the objective function (loss function) is:

\[ J(\theta) = \frac{1}{2}\sum_{i=1}^m(h(x^{(i)})-y^{(i)})^2 + \lambda \sum_{j=0}^n \theta_j^2 = \frac{1}{2}\left(X\theta-Y\right)^T\left(X\theta-Y\right) + \frac{1}{2}\lambda\left \| \theta \right \|_2^2 \]

Wherein the second \ (\ frac {1] { 2} \) is added to facilitate later calculation. Ridge regression coefficients not discard any features (i.e., \ (\ theta_j \) is not zero), the regression coefficient such that the small, relatively stable, but with more features than are reserved LASSO regression.

A guide Ridge regression of the objective function is:

\[ \begin{aligned} \frac{\partial J(\theta)}{\partial \theta} &= \frac{\partial}{\partial \theta}\left(\frac{1}{2}\left(X\theta-Y\right)^T\left(X\theta-Y\right) + \frac{1}{2}\lambda\left \| \theta \right \|_2^2\right)\\ &=X^TXθ−X^TY + \frac{\partial}{\partial \theta}\left( \frac{1}{2}\lambda\left \| \theta \right \|_2^2 \right)\\ &=X^TXθ−X^TY + \frac{\partial}{\partial \theta}\left( \frac{1}{2}\lambda \theta^T\theta \right)\\ &=X^TXθ−X^TY + \lambda \theta \end{aligned} \]

Order derivative is zero, can be obtained \ (\ Theta \) values:

\[ \theta = (X^TX+\lambda I)^{-1}X^TY \]

Visible results obtained with linear regression analytical solution is added in the described disturbance consistent results.

Guess you like

Origin www.cnblogs.com/Ooman/p/11350095.html