Regression Analysis - Normal Distribution Summary

1. The main reason why the data is not a standard normal distribution comes from the skewness and kurtosis of the data.
2. The usual ways to deal with skewness are exponential transformation, logarithmic transformation and box-cox transformation.
3. The usual way to deal with kurtosis is to standardize the data.
4. In addition, the standardization of data cannot turn skewed data into a normal distribution.
5. The significance of data standardization is mainly that in regression problems, it can make variables more comparable, and at the same time, the analysis results are easier to meet the residual error assumption of regression analysis. 6. Data standardization can only change
a normal distribution into a standard normal distribution. Normal distribution does not change skewed data into normal distribution. Therefore, data standardization deals with the case where the kurtosis of normal distribution data is not equal to 3 (standard normal distribution) and the mean value of normal data is not 0.
Introduction of dry goods:
1. Source: Data preprocessing: normalization and standard transformation of data
2. Introduction of nonlinear transformation QuantileTransformer and BoxCox
box-cox: mapping to Gaussian distribution
QuantileTransformer: mapping to uniform distribution
3. Normal transformation steps and methods Introduction
The characteristics of symmetrical distribution: left-right symmetry, mean = median = mode, skewness = 0
The characteristics of positively skewed distribution: long tail on the right, mean > median > mode, skewness > 0
negatively skewed distribution Features: long tail on the left, mean < median < mode, skewness < 0
1. If it is moderately skewed,
if the skewness is 2-3 times its standard error, you can consider taking the root sign value to convert.

2. If highly skewed
If the skewness is more than 3 times its standard error, logarithm can be taken, which can be divided into natural logarithm and logarithm with base 10.

1. The standard errors of skewness and kurtosis are directly related to the sample size. Specifically, the standard error of skewness is approximately equal to the square root of 6 divided by n, and the standard error of kurtosis is approximately equal to the square root of 24 divided by n, where n is the sample size. It can be seen that the larger the sample size, the smaller the standard error.

2. The normal transformation method of data is not universal. It is necessary to select or create a suitable transformation formula according to different data distribution conditions. After transformation, the transformation effect must be verified to finally achieve the goal of transformation.

3. Not all non-normally distributed data can be transformed into normally distributed data through normal transformation. Data that are not normally distributed can also be analyzed using nonparametric methods.

4. Regarding the conversion method of right-biased data,
many right-biased data can be normalized
. After logarithmic transformation, it is normally distributed and the variance is stable.
For less serious right-biased data, use square root transformation. For
severe right-biased data, the reciprocal transformation
now sorts out data greater than 1. The ones are pretty good. For numbers between 0-1 [especially probability], it is easy to find

5. Multicollinearity
 Definition:

Case:
2022 SAS China University Data Analysis Contest rematch topic - agricultural product futures - non-normal (bimodal) data normalization processing (sklearn) use LightGBM
to simulate and predict Boston housing prices - logarithmic transformation of target variables - numerical variables BoxCox transform

Guess you like

Origin blog.csdn.net/txmmy/article/details/128114754