Centralization, standardization and normalization processing

Centralization (Zero-centered or Mean-subtraction)

That is x ′ = x − μ x^(')=x-μx=xμ
obtains data with a mean value of 0, which is one of the steps of standardization processing

Insert picture description here

effect:

  1. For the covariance matrix in PCA, centralization can make the calculation of the covariance matrix smaller and has no effect on the result.
  2. Let the model not consider bias, but only focus on weights.
  3. Increase the orthogonality of basis vectors

Standardization

x ′ = x - μ σ x ^ {\ prime} = \ frac {x- \ mu {{\ sigma}x=σxm
Obtain data with a mean value of 0 and a standard deviation of 1 , and the transformation is linear. If X was originally a normal distribution N(u, v), then Z satisfies the normal distribution N(0, 1). If X is not normally distributed, Z will not satisfy the normal distribution

Insert picture description here

effect:

  1. Data standardization (normalization) processing is a basic work of data mining. Different evaluation indicators often have different dimensions and dimensional units. This situation will affect the results of data analysis. In order to eliminate the dimension between indicators Impact, data standardization is required to solve the comparability between data indicators. After the raw data is processed by data standardization, the indicators are in the same order of magnitude, suitable for comprehensive comparative evaluation.

  2. Speeds up the gradient descent to find the optimal solution, and accelerates the convergence of the weight parameters
    as shown in the figure below. The value of x1 is 0-2000, and the value of x2 is 1-5. If there are only these two features, when optimizing it , Will get a long and narrow ellipse, resulting in a zigzag route in the direction of the vertical contour when the gradient drops. This will make the iteration very slow. In contrast, the iteration on the right will be Soon (understanding: that is, the direction is always right to go more and less in the step length, and will not go astray)
    Insert picture description here

  3. It is possible to improve accuracy. Some classifiers (KNN, SVM, deep learning) need to calculate the distance between samples (such as Euclidean distance), such as KNN. If a feature value range is very large, then the distance calculation mainly depends on this feature, which is contrary to the actual situation (for example, at this time, the actual situation is that the feature with a small value range is more important).

Normalization (Normalization)

x ∗ = x − min ⁡ max ⁡ − min ⁡ x^{*}= \frac{x- \min}{\max - \min} x=maxminxme
Turn the number into a decimal between (0, 1)

Standardization and normalization

with:

Standardization and Normalization are essentially linear transformations of data

different:

  1. Normalization will strictly limit the range of the transformed data. For example, Normalization processed according to the previous maximum and minimum values, its range is strictly between [0,1];
    while Standardization does not have a strict interval, and the transformed data has no range, just The mean is 0 and the standard deviation is 1.
  2. Normalization (Normalization) the scaling of data is only related to the extreme value, that is, for example, 100 numbers, you remove the maximum value and the minimum value and other data are replaced, the scaling ratio is unchanged; on the other hand, for normalization (Standardization), if all the data except the maximum value and the minimum value are replaced, the mean value and standard deviation will likely change. At this time, the scaling ratio will naturally change.

Precondition of use:

  1. When the scales (dimensions) of different dimensional features of the original data are inconsistent, a standardization step is required to standardize or normalize the data, otherwise, data standardization is not required.
  2. Not all models need to be normalized. For example, is there a measure of distance or standard deviation between variables in the model algorithm? For example, decision tree, the algorithm used by him does not involve anything related to distance, so when making a decision tree model, there is usually no need to standardize the variables; in addition, the probability model does not need to be normalized because they do not care The value of the variable is concerned with the distribution of the variable and the conditional probability between the variables.

usage

  1. If there are strict requirements on the scope of the processed data, it must be normalized.
  2. When the distance measurement and covariance calculation are not involved, the normalization method can be used.
  3. Standardization is a more common method in ML. If you have no idea, you can use standardization directly;
  4. If the data is not stable and there are extreme maximum and minimum values, do not use normalization.
  5. In classification and clustering algorithms, when distance is needed to measure similarity, or when PCA technology is used for dimensionality reduction, standardization performance is better;

Small note

The two Chinese words "standardization" and "normalization" refer to the four Feature scaling methods
Rescaling x ′ = x − min ⁡ (x) max ⁡ (x) − min ⁡ (x) x^ {\prime}= \frac{x- \min(x)}{\max(x)- \min(x)}x=max(x)min ( x )xmin ( x )
Mean normalization x ′ = x − m e a n ( x ) max ⁡ ( x ) − min ⁡ ( x ) x^{\prime}= \frac{x-mean(x)}{\max(x)- \min(x)} x=max(x)min ( x )xmean(x)
Standardization x ′ = x − x ‾ σ x^{\prime}= \frac{x- \overline{x}}{\sigma} x=σxx
Scaling to unit length x ′ = x ∣ ∣ x ∣ ∣ x^{\prime}= \frac{x}{||x||} x=xx

ps: I don't think that only normalization makes the ellipse into a circle. . .
Insert picture description here

Here is the quote
https://blog.csdn.net/weixin_36604953/article/details/102652160
https://www.zhihu.com/question/20467170

Guess you like

Origin blog.csdn.net/weixin_42764932/article/details/112676203