Centralization and normalization of data preprocessing

table of Contents

1. Background

2. Purpose

3. Principle

4. Meaning

5. Standardization (normalization) advantages and methods

6. Use scenarios of normalization method

references


In order to solve classification and regression problems in machine learning, it is usually necessary to centralize and standardize the original data.

1. Background

In the process of data mining and data processing, different evaluation indicators often have different dimensions and dimensional units. This situation will affect the results of data analysis. In order to eliminate the dimensional influence between indicators, data standardization is required. To solve the comparability between data indicators. After the raw data is processed by data standardization, the indicators are in the same order of magnitude, suitable for comprehensive comparative evaluation.

2. Purpose

Through centralization and standardization, data with a mean value of 0 and a standard deviation of 1 obeys the standard normal distribution. You can cancel errors caused by different dimensions, self-variation, or large differences in values.

3. Principle

Centralization (also called zero averaging): refers to the variable minus its mean. In fact, it is a translation process, the center of all data after translation is (0,0).

Standardization (also called normalization): It is the index value minus the mean value, and then divided by the standard deviation.

4. Meaning

In some practical problems, the sample data we get is multi-dimensional, that is, a sample is characterized by multiple features. For example, in the problem of predicting housing prices, the factors (features) that affect housing prices include the area of ​​the house, the number of bedrooms, etc. Obviously, the dimensions and numerical magnitudes of these features are different. When predicting housing prices, if you directly use Original data values, then their degree of influence on housing prices will be different, and through standardization processing, different features can be made to have the same scale. In short, when the scales (units) of features in different dimensions of the original data are inconsistent, a standardized step is required to preprocess the data.

5. Standardization (normalization) advantages and methods

5.1 Two advantages of standardization (normalization):

1) After normalization, the speed of gradient descent to find the optimal solution is accelerated;

2) Normalization helps to improve accuracy.

5.2 Two methods of standardization (normalization):

1) Min-Max Normalization (Min-MaxNormalization), also known as dispersion standardization, is a linear transformation of the original data, so that the result value is mapped between [0-1].

The conversion function is as follows:

 

 

Where max is the maximum value of the sample data, and min is the minimum value of the sample data. One drawback of this method is that when new data is added, it may lead to changes in max and min, which need to be redefined.

2) Z-score standardization (0-1 standardization) method. This method gives the mean and standard deviation of the original data to standardize the data. The processed data conforms to the standard normal distribution, that is, the mean is 0 and the standard deviation is 1.

The conversion function is as follows:

6. Use scenarios of normalization method

In classification and clustering algorithms, when distance is needed to measure similarity, or PCA is used for dimensionality reduction, the Z-score standardization method performs better.

The min-max standardization method can be used when distance measurement, covariance calculation, and data do not conform to the normal distribution are not involved.

Reason: using the min-max standardization method, the covariance has a multiple value scaling, so this method cannot eliminate the influence of dimension variance and covariance, which has a huge impact on PCA analysis; at the same time, due to the existence of dimensions, use The calculation results of different dimensions and distances will be different.

In the Z-score standardization method, the new data is normalized due to the variance. At this time, the dimensions of each dimension are actually equivalent, and each dimension obeys a normal distribution with a mean of 0 and a variance of 1. When calculating the distance, the dimensions of each dimension have been removed, avoiding the huge influence of the selection of different dimensions on the distance calculation.

references

[1] Centralization (zero averaging) and standardization (normalization) of data preprocessing

[2] Data standardization/normalization normalization

 

Guess you like

Origin blog.csdn.net/weixin_45317919/article/details/109194690