Normalization, Standardization, and Zero-centered

Table of contents

1. Concept

1. Normalization:

 2. Standardization:

3. Centralization/zero-centered:

2. Connections and differences:

3. Various ways of standardization and normalization

3. Why normalize/standardize?

3.1. The dimensions of the data are different; the order of magnitude varies greatly

3.2. Avoid numerical problems: Too large numbers will cause numerical problems.

3.3. Balance the contribution of each feature

3.4. The need for some model solutions: speed up the gradient descent to find the optimal solution

4. When to use normalization? When to use normalization?

4.1 Normalized and standardized usage scenarios

4.2. Should all situations be Standardized or Normalized?

Five, why the neural network should be normalized

5.1 Numerical issues

5.2 Solving needs


1. Concept

1. Normalization:

(1) Change a column of data into a fixed interval (range). Usually, this interval is a decimal between [0, 1] or (-1, 1) . It is mainly proposed for the convenience of data processing. It is more convenient and faster to map the data to the range of 0 to 1 for processing.

(2) Change dimensioned expressions into dimensionless expressions, so that indicators of different units or magnitudes can be compared and weighted. Normalization is a way of simplifying calculations, that is, transforming a dimensioned expression into a dimensionless expression and becoming a scalar.

From the formula: the normalized output range is between 0-1

        ​​​​​​​​This method realizes the proportional scaling of the original data. By using the maximum value and minimum value (or maximum value) of the variable value to convert the original data into data within a certain range, thereby eliminating the impact of dimension and order of magnitude, changing the weight of variables in the analysis to solve the problem of different metrics question. Since the extreme value method is only related to the two extreme values ​​of the maximum value and the minimum value of the variable in the process of dimensionless variable, but has nothing to do with other values, this makes the method rely too much on the two extreme values ​​when changing the weight of each variable. an extreme value.

 2. Standardization:

After subtracting the mean from the original data, divide by the standard deviation. Transform the data into a distribution with a mean of 0 and a standard deviation of 1. Remember: it is not necessarily normal.

Although this method utilizes all the data information in the process of dimensionless, this method not only makes the mean values ​​of the converted variables the same, but also makes the standard deviations the same after dimensionless, that is, it also eliminates all variables while being dimensionless. The difference in the degree of variation of variables, so the importance of each variable after conversion in cluster analysis is treated equally.

From the formula: the standardized output range is negative infinity to positive infinity

# python实现
X -= np.mean(X, axis = 0)  # zero-center
X /= np.std(X, axis = 0)   # normalize

In machine learning, we may have to deal with different kinds of data, for example, pixel values ​​on audio and pictures. These data may be high-dimensional. After the data is normalized, the value in each feature will be changed to 0 on average (each The value of each feature is subtracted from the average of the feature in the original data), and the standard deviation becomes 1. This method is widely used in many machine learning algorithms (for example: support vector machines, logistic regression and neural networks).

​​​​​​​

 The left image in the above image is the visualization of the original image, the middle image is the visualization after subtracting the mean, and the right image is the visualization after dividing by the standard deviation.

3. Centralization/zero-centered:

Centralization, also known as zero-mean processing, is to subtract the mean of these data from each original data. Therefore, the average value of the data after centralization is 0, and there is no requirement for the standard deviation.

2. Connections and differences:

The difference between normalization and standardization:

  • Normalization is to convert the eigenvalues ​​of the samples into the same dimension and map the data to the interval [0,1] or [-1, 1]. The scaling is only determined by the maximum and minimum extreme values ​​of the variables.
  • Standardization is related to the overall sample distribution. Compared with normalization, each sample point in standardization can affect the standardization (through the mean and standard deviation).
  • Obviously, normalization (Normalization) will strictly limit the range of transformed data, such as Normalization processed according to the previous maximum and minimum values, its range is strictly between [0,1]; while standardization (Standardization) does not have
    strict Interval, transformed data has no range, but its mean is 0 0, standard deviation is 1, and the range of transformed data is from negative infinity to positive infinity .

Similarities between normalization and standardization:

  • What they have in common is that they can cancel the error caused by different dimensions; they are both linear transformations in essence, because both of them are linear transformations that do not change the order of the original data, and both compress the vector X in proportion to pan.

The difference between standardization and centralization:

  • Standardization is the raw score minus the mean and then divided by the standard deviation, and centering is the raw score minus the mean. Therefore, the general process is first centralized and then standardized.

Can achieve dimensionless: My understanding is that the unit in the actual process can be removed by a certain method, thereby simplifying the calculation.

Summarize:

comparison point Normalized standardization
concept Reduce the value to (0,1) or (-1,1) interval Reduce the distribution of the corresponding data to a distribution with a mean of 0 and a standard deviation of 1
focus The normalization of the value, the loss of the distribution information of the data, the distance between the data is not well preserved, but the weight is retained The normalization of the data distribution better preserves the distribution between the data, that is, the distance between the samples is preserved, but the weight is lost
form insert image description here insert image description here
shortcoming

1. The distance information between samples is lost;

2. Poor robustness, when new samples are added, the maximum and minimum values ​​are easily affected by outliers

1. The weight information between samples is lost;
suitable for the scene

1. The use of small data/fixed data;

2. When distance measurement, covariance calculation, and data do not conform to normal distribution are not involved;

3. When performing multi-index comprehensive evaluation;

1. In classification and clustering algorithms, distance is needed to measure similarity;

2. It has good robustness, when there are discrete data in the output value range or when the maximum and minimum values ​​are unknown;

zoom mode First use the minimum value to translate, and then use the maximum value difference to scale First use the mean u to translate, and then use the standard deviation to scale
Purpose It is convenient to eliminate dimensions and incorporate the data of each index into the comprehensive evaluation; It is convenient for subsequent gradient descent and activation function to process data. Because after normalization, the data is distributed around 0, and the functions sigmoid, Tanh, Softmax, etc. are also distributed around 0;

3. Various ways of standardization and normalization

Broadly speaking, standardization and normalization are both linear changes to the data, so we don't need to stipulate that the normalization must be between [0,1].

1. Normalization, the most common mode of normalization, is also called linear normalization:

2、Mean normalization:

 3. Standardization (Standardization), Z-score normalization, also called standard deviation standardization/zero mean standardization

3. Why normalize/standardize?

As mentioned above, normalization/standardization is essentially a linear transformation. Linear transformation has many good properties. These properties determine that the data will not be "failed" after being changed, but can improve the performance of the data. These properties are Prerequisites for normalization/standardization. For example, there is a very important property: linear transformation will not change the numerical order of the original data.

3.1. The dimensions of the data are different; the order of magnitude varies greatly

After standardized processing, the original data is converted into dimensionless index evaluation values, and each index value is at the same quantitative level, which can be used for comprehensive evaluation and analysis.

If the original index values ​​are directly used for analysis, the role of indicators with higher numerical levels in the comprehensive analysis will be highlighted, and the role of indicators with lower numerical levels will be relatively weakened.

3.2. Avoid numerical problems: Too large numbers will cause numerical problems.

3.3. Balance the contribution of each feature

Some classifiers need to calculate the distance between samples (such as Euclidean distance), such as KNN.

If the value range of a feature is very large, the distance calculation mainly depends on this feature, which is contrary to the actual situation (for example, the actual situation is that the feature with a small value range is more important).

3.4. The need for some model solutions: speed up the gradient descent to find the optimal solution

1) When using the gradient descent method to solve the optimization problem, after normalization/standardization, the solution speed of the gradient descent can be accelerated, that is, the convergence speed of the model can be improved. As shown in the figure on the left, the contour line formed when it is not normalized/standardized is elliptical, and it is likely to take a "zigzag" route (vertical long axis) during iteration, which leads to many iterations before convergence. As shown in the figure on the right, the two features are normalized, and the corresponding contour lines will become round, which can converge faster when the gradient descent is used to solve the problem.

Not normalized
After normalization

Example analysis assumptions:

theta1 is the number of rooms, the range can be 0~10

theta2 is the room area, the range can be 0~1000

There is no normalized expression for the data, which can be:

 After the data is normalized, the expression of the loss function can be expressed as:

It can be seen from the normalized formula that the front coefficients of the variables are almost the same, so the contour line of the image is similar to a circle, and the optimization process of the optimal solution is shown in the figure above.

At the same time, it can be seen from the above that after the data is normalized, the optimization process of the optimal solution will obviously become smoother, and it is easier to correctly converge to the optimal solution. 

4. When to use normalization? When to use normalization?

4.1 Normalized and standardized usage scenarios

  1. If there is a requirement for the range of output results, use normalization.
  2. If the data is relatively stable and there are no extreme maximum and minimum values, use normalization.
  3. If there are outliers and more noise in the data, using standardization can indirectly avoid the influence of outliers and extreme values ​​through centralization.

A Zhihu blogger shared his personal experience: Generally speaking, I personally recommend using standards first. When there is a requirement for output, try other methods, such as normalization or more complex methods. Many methods can adjust the output range to [0, 1]. If we have assumptions about the distribution of the data, a more effective method is to use the corresponding probability density function to convert.

4.2. Should all situations be Standardized or Normalized?

When the scales (dimensions) of different dimensional features of the original data are inconsistent, a standardization step is required to standardize or normalize the data, otherwise, data standardization is not required. Not all models need to be normalized. For example, there is no measurement of distance in the model algorithm, and no measurement of standard deviation between variables. For example, the decision tree, the algorithm he uses does not involve anything related to distance, so when making a decision tree model, it is usually not necessary to standardize the variables; in addition, the probability model does not need to be normalized, because they do not care Instead, we care about the distribution of variables and the conditional probabilities between variables.

Five, why the neural network should be normalized

5.1 Numerical issues

Normalization/standardization can avoid some unnecessary numerical problems. The magnitude of the input variables may cause numerical problems. Because the nonlinear interval of tansig (tanh) is about [-1.7, 1.7]. It means that to make the neuron effective, the order of w1x1 +w2x2 +b in tansig(w1x1 + w2x2 +b) should be around the order of 1.7. At this time, if the input is larger, it means that the weight must be smaller, one is larger and the other is smaller, and the multiplication of the two will cause numerical problems.
If your input is 421, you may think that this is not a large number, but because the effective weight is about 1/421, the result is about 1, such as 0.00243, then, input 421*0.00243 in matlab == 0.421*2.43, you will find that they are not equal, this is a numerical problem.

5.2 Solving needs

a. Initialization: During initialization, we hope that each neuron is initialized to an effective state. The tansig function has a good nonlinearity in the range of [-1.7, 1.7], so we hope that the input of the function and the initialization of neurons can be Make each neuron valid initially within a reasonable range. (If the weight value is initialized at [-1,1] and the input is not normalized and too large, it will saturate the neurons)

b. Gradient: Taking the input-hidden layer-output three-layer BP as an example, we know that the gradient of the input-hidden layer weight has the form of 2ew(1-a^2)*x (e is the error, w is The weight from the hidden layer to the output layer, a is the value of the neuron in the hidden layer, x is the input), if the order of magnitude of the output layer is large, it will cause the order of magnitude of e to be very large, similarly, w is for the hidden layer (order of magnitude is 1) When reflected in the output layer, w will also be very large, and if x is also very large, it can be seen from the gradient formula that when the three are multiplied together, the gradient will be very large. This can cause numerical problems in updating the gradient.

c. Learning rate: From b, if the gradient is very large, the learning rate must be very small. Therefore, the selection of the learning rate (the initial value of the learning rate) needs to refer to the range of the input. It is better to normalize the data directly, so that the learning rate will be It is no longer necessary to adjust according to the data range. The weight gradient from the hidden layer to the output layer can be written as 2ea, and the weight gradient from the input layer to the hidden layer is 2ew(1-a^2)x, affected by x and w, the order of magnitude of each gradient is different, therefore, The order of magnitude of the learning rate they need is also different. The learning rate suitable for w1 may be too small compared to w2. If a learning rate suitable for w1 is used, it will lead to a very slow step in the direction of w2, which will consume a lot of time, while using a learning rate suitable for w2 The rate is too large for w1, and no solution suitable for w1 can be found. If you use a fixed learning rate and the data is not normalized, the consequences can be imagined.

d. Search trajectory: explained above

Another article:

Deep Dive: Why Feature Normalization/Standardization?

Guess you like

Origin blog.csdn.net/ytusdc/article/details/128504272