【Machine Learning】Data Standardization in Regression Process

It has been very confusing recently, because some data often require data standardization when performing regression. However, some sources say that standardization is not required for linear regression. After inquiring a lot of information, I am ready to make a summary on this blog.

Why standardize data

The rationale for data normalization is often derived from the dimensions of independent continuous variables. As an example, if we were to regress on the population, the regression coefficients obtained by taking the dimension of "number" and the dimension of "million" would be quite different. When processing, the regression coefficient of the dimension of "number" is too small. At this time, it is necessary to standardize the original data so that each variable has the same range or variance.

About data standardization, centralization

Normalization: The normalization of data is to scale the data so that it falls into a small specific interval. It is often used in some comparison and evaluation index processing to remove the unit limit of the data and convert it into a dimensionless pure value. Commonly used standardization are: Min-Max scaling, Z score

Centering: That is, the variable is subtracted from its mean, and the data is shifted.

write picture description here

When to centralize data

  1. In the clustering process, standardization is particularly important. This is because the clustering operation relies on a measure between the inter-class distance and the intra-class clustering. If one variable has a higher metric than others, then any metric we use will be unduly influenced by that variable.

  2. Before the PCA dimensionality reduction operation. Standardization of variables is crucial before principal component PCA analysis. This is because PCA assigns more weight to those variables with higher variance than those with very small variance. Whereas normalizing the raw data yields the same variance, so high weights are not assigned to variables with higher variance

  3. KNN operation, the reason is similar to kmeans clustering. Because KNN needs to use Euclidean distance to measure. Standardization makes variables play the same role.

  4. In SVM, data normalization is required to use all kernels related to distance calculation.

  5. Standardization is a must when choosing between Ridge Regression and Lasso. The reason is that regularization is a biased estimate and penalizes the weights. In the case of different dimensions, regularization will bring greater bias.

when standardization is not needed

  1. When using ordinary linear regression , there is no need for normalization. Because before and after standardization, it will not affect the predicted value of linear regression.
  2. At the same time, normalization does not affect logistic regression, decision trees and some other ensemble learning algorithms: such as random forest and gradient boosting.

Proof that linear regression does not require normalization:

write picture description here

In linear regression, normalization will not affect the final prediction, even if the weights during training will be different.

The following experiments are carried out in R language to prove:

1. When standardization is not adopted:

write picture description here

Prediction on the test set:
write picture description here

2. When standardization is adopted:

write picture description here

After normalizing the data before training

write picture description here

When making predictions, we still need to standardize the input data, and we can use the mean and variance on the training set.

write picture description here

3. Compare the predicted value between standardized and unstandardized:

write picture description here

The normalized prediction here is rescaled. You can see whether standardization will not affect the regression results.

Some other fallible points about linear regression

  1. Using Linear Models Without Considering Linear Correlations
    Imagine building a linear model with two variables X1 and X2, assuming the true model is Y=X1+X2. Ideally, if the observed data contains a small amount of noise, a linear regression solution will restore the true model. However, if X1 and X2 are linearly related (which is what most optimization algorithms care about), Y=2*X1, Y=3*X1-X2 or Y=100*X1-99*X2 are all equally good, the problem may not be Nothing wrong, because it is an unbiased estimate. However, it makes the problem ill-conditioned and makes the coefficient weights unexplainable.

  2. Interpreting the absolute value of the coefficients of a linear or logistic regression model as feature importance Because many existing linear regressors return a P value for each coefficient , for linear models, many practitioners believe that the larger the absolute value of the coefficient, the stronger the corresponding feature. important. This is rarely the case because: (a) changing the variable scale changes the absolute value of the coefficients; (b) if the features are linearly related, the coefficients can be transferred from one feature to another. In addition, the more features in the dataset, the more likely the features are linearly correlated, and the less reliable it is to use coefficients to explain feature importance.

refer to:

  1. https://www.listendata.com/2017/04/how-to-standardize-variable-in-regression.html
  2. https://stats.stackexchange.com/questions/86434/is-standardisation-before-lasso-really-necessary
  3. https://stats.stackexchange.com/questions/29781/when-conducting-multiple-regression-when-should-you-center-your-predictor-varia

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325724170&siteId=291194637