Dimensional disaster and Lasso regression

Dimensional disasters
High-dimensional data: refers to the number of data dimensions is very high, even much larger than the sample size.
The performance of high-dimensional data is that the data in space is very sparse, and the sample size always appears very small compared to the dimension of space. When using OneHotEncoding to construct a bag of words model, it is very easy to generate a sparse matrix.
Dimensional disaster: The biggest problem encountered in the process of expanding from low to high dimensionality is the expansion of dimensionality, which is what we call dimensional disaster. As the number of dimensions increases, the number of spatial samples required for analysis increases exponentially.

The performance of the dimension from low to high:

More samples are needed, and the samples grow exponentially as the data dimension increases
Data becomes more sparse, leading to dimensional disaster
In the high-dimensional data space, forecasting will no longer be easy
It is easy to cause the model to overfit

Ways to solve the overfitting caused by the dimensional disaster:

Increase sample size
Reduce sample characteristics (data dimensionality reduction)

Common methods of data dimensionality reduction:

Principal component analysis
Ridge regression (L2 regularization, because the penalty term is the square of lambda * beta)
Lasso regression (L1 regularization because the penalty term is the absolute value of lambda * beta)

Principal component analysis: A data dimensionality reduction method. The starting point is to integrate a single variable to obtain a new set of integrated variables. The integrated variables are rich in meaning and not related to each other. The integrated variables contain most of the original variables (usually the original variables are retained About 80% of the information in the list, excluding the remaining 20% of the information that is less interpretable for Y). Principal component analysis is based on preserving all original variables and obtaining principal components through linear combination of original variables. Selecting a few principal components can retain most of the information in the original variables, so that a few principal components can be used to replace the original variables, so The purpose of dimensionality reduction.
Note: The principal component analysis method is only applicable when the spatial dimension is smaller than the sample size (d <n). When the data spatial dimension is very high, it will no longer be applicable. In view of this situation, we can actually consider using random forests or decision trees to select some dimensions, and then use principal component analysis to reduce the dimensions based on these newly selected partial dimensions.

Ridge regression and lasso regression are to solve the problem of overfitting. Specifically, it is to solve the problem of a large number of features and the correlation between features.
s://img-blog.csdnimg.cn/20190925193158945.png?x-oss-
Ridge regression builds a perturbation by adding a quadratic term of lambda, so that the variance and Balance between deviations

Perturbation theory?
Insert picture description here

Ridge regression coefficient selection: at the
bell mouth, and some coefficients at the bell mouth are regularized to 0; each line in the above picture is drawn through ten cross-validations

Lasso is a data dimensionality reduction method, which is applicable not only to linear cases, but also to nonlinear cases. Lasso is based on the penalty method for variable selection of sample data. By compressing the original coefficients, the original small coefficients are directly compressed to 0, so that the variables corresponding to this part of coefficients are regarded as non-significant variables and will not be significant. The variables are discarded directly.
Insert picture description here

lasso regression model: L1 regularization is added to the square error.
The difference between lasso regression and ridge regression is that the loss function is not derivable at theta = 0 (because lasso regression takes the absolute value term), so the traditional gradient-based method Can not be directly used to solve the loss function

** The main difference between ridge regression and lasso regression

Although ridge regression also compresses the variable coefficients, it does not directly compress the coefficients to 0, thereby retaining all variables; while lasso regression can directly compress some variable coefficients to 0, thereby achieving the purpose of dimensionality reduction * *

Insert picture description here The left picture is the lasso method, and the right picture is the ridge regression method.
Taking the two-dimensional space as an example, the two graphs correspond to contours and constraint domains of the two methods. Red represents the sum of squared residuals with the change of lambda, and beta is the center point of the ellipse, corresponding to the least squares estimate of the ordinary linear model. The difference between the left and right sides is the constraint domain, which corresponds to the blue area.

Reference link:
https://blog.csdn.net/Joker_sir5/article/details/82756089

https://blog.csdn.net/xiaozhu_1024/article/details/80585151

tomwang0322

Published 69 original articles · praised 11 · 20,000+ views

Private letter concerns

Dimensional disaster and Lasso regression

Guess you like