Data normalization/normalization

Data normalization/normalization

2016-08-19 09:42:40

Readings: 82671

http://blog.csdn.net/pipisorry/article/details/52247379

Here we mainly talk about the common methods of continuous feature normalization. Discrete Reference [ Data Preprocessing: One-Hot Encoding ].

Basic knowledge reference:

[ Mean, variance and covariance matrices ]

[ Matrix Theory: Vector Norms and Matrix Norms ]

Normalization and normalization of data

    Normalization of data is the scaling of data so that it falls within a small specific interval. It is often used in the processing of some indicators of comparison and evaluation, to remove the unit limitation of the data and convert it into a dimensionless pure value, so that indicators of different units or magnitudes can be compared and weighted. The most typical one is the normalization of data, that is, the unified mapping of data to the [0,1] interval.

    At present, there are many methods of data standardization, which can be summed up as linear methods (such as extreme value method, standard deviation method), broken line method (such as tri-broken line method), and curvilinear method (such as half-normal distribution). Different standardization methods will have different effects on the evaluation results of the system. Unfortunately, there is no general rule to follow in the selection of data standardization methods.

normalized target

1 Converting the number to a decimal between (0, 1) is
        mainly proposed for the convenience of data processing. It is more convenient and fast to map the data to the range of 0 to 1 for processing, and should be classified into the category of digital signal processing.
2 Converting a dimensional expression into a dimensionless expression
        Normalization is a way to simplify calculations, that is, a dimensional expression is transformed into a dimensionless expression and becomes a scalar. For example, complex impedance can be normalized and written: Z = R + jωL = R(1 + jωL/R) , the complex part becomes a pure quantity and has no dimension.
In addition, in the microwave, that is, circuit analysis, signal system, electromagnetic wave transmission, etc., there are many operations that can be processed in this way, which not only ensures the convenience of operations, but also highlights the essential meaning of physical quantities.

 

After normalization, there are two benefits

1. Improve the convergence speed of the model

As shown in the figure below, the value of x1 is 0-2000, and the value of x2 is 1-5. If there are only these two features, when they are optimized, a narrow and long ellipse will be obtained, resulting in gradient descent. The direction of the gradient is the direction of the vertical contour line and the zigzag route , which will make the iteration very slow, in contrast, the iteration of the right image will be very fast (understanding: that is, the step size is more and the less direction is always Yes, it won't go astray)


2. Improve the accuracy of the model

Another advantage of normalization is to improve the accuracy, which has a significant effect when it involves some distance calculation algorithms . For example, the algorithm needs to calculate the Euclidean distance. The value range of x2 in the above figure is relatively small. The effect of the result is much smaller than that brought by x1, so this will cause a loss of precision. So normalization is necessary, he can make the contribution of each feature to the result the same.

    In the multi-index evaluation system, due to the different nature of each evaluation index, it usually has different dimensions and orders of magnitude. When the level of each index differs greatly, if the original index value is directly used for analysis, the effect of the index with higher numerical value in the comprehensive analysis will be highlighted, and the effect of the index with lower numerical level will be relatively weakened. Therefore, in order to ensure the reliability of the results, it is necessary to standardize the original indicator data.

    Before data analysis, we usually need to normalize the data and use the normalized data for data analysis. Data standardization is also the indexation of statistical data. Data normalization processing mainly includes two aspects: data homogenization processing and dimensionless processing. Data co-taxis processing mainly solves the problem of data of different natures. The direct summation of indicators of different nature cannot correctly reflect the comprehensive results of different forces. It is necessary to first consider changing the nature of the inverse indicator data, so that the forces of all indicators on the evaluation plan are co-treating. , and then add up to get the correct result. Data dimensionless processing mainly solves the comparability of data. After the above standardization process, the original data are converted into dimensionless index evaluation values, that is, each index value is at the same level of magnitude, and comprehensive evaluation and analysis can be carried out.

From experience, normalization is to make the features between different dimensions have a certain numerical comparison, which can greatly improve the accuracy of the classifier.

3. Data normalization in deep learning can prevent model gradient explosion.

Data needs to be normalized by machine learning algorithms

Models that need to be normalized:

        After some models are unevenly scaled in various dimensions, the optimal solution is not equivalent to the original, such as SVM (the distance from the interface is also closer, and the support vector increases?). For such a model, unless the distribution range of the data in each dimension is relatively close, standardization must be performed to prevent the model parameters from being dominated by data with a larger or smaller distribution range.
        After some models are unevenly scaled in various dimensions, the optimal solution is equivalent to the original, such as logistic regression (because the size of θ already learns the importance of different features by itself?). For such a model, normalization does not theoretically change the optimal solution. However, since the actual solution often uses an iterative algorithm, if the shape of the objective function is too "flat", the iterative algorithm may converge very slowly or even not converge. Therefore, for models with scaling invariance, it is best to perform data normalization as well.

Models that do not need normalization:

        ICA does not seem to need normalization (because independent components are not independent if they are normalized?).

       Least squares OLS based on squared loss does not require normalization.

[ Linear regression and feature scaling (feature scaling) ]

Pippi blog

 

 

Common data normalization methods

The most commonly used are min-max normalization and z-score normalization.

Min-max normalization / 0-1 normalization / linear function normalization / dispersion normalization

It is a linear transformation of the original data, so that the result falls into the [0,1] interval. The transformation function is as follows:

where max is the maximum value of the sample data, and min is the minimum value of the sample data.

def Normalization(x):
    return [(float(i)-min(x))/float(max(x)-min(x)) for i in x]

If you want to map the data to [-1,1], replace the formula with:

x* = x* * 2 -1

or make an approximation

x* = (x - x_mean)/(x_max - x_min), x_mean represents the mean of the data.

def Normalization2(x):
    return [(float(i)-np.mean(x))/(max(x)-min(x)) for i in x]

A drawback of this method is that when new data is added, it may cause changes in max and min, which need to be redefined.

z-score normalization (zero-mean normalization)

The most common standardization method is Z standardization, which is also the most commonly used standardization method in SPSS. The default standardization method of spss is z-score standardization.

Also called standard deviation standardization, this method gives the mean and standard deviation of the original data to standardize the data.

The processed data conforms to the standard normal distribution, that is, the mean is 0 and the standard deviation is 1. The transformation function is:

x * = (x - μ) / σ

where μ is the mean of all sample data and σ is the standard deviation of all sample data.

The z-score normalization method is suitable for situations where the maximum and minimum values ​​of attribute A are unknown, or when there are outlier data beyond the value range. This normalization method requires that the distribution of the original data can be approximated as a Gaussian distribution, otherwise the normalization effect will become very bad.

The standardized formula is very simple, the steps are as follows

  1. Find the arithmetic mean (mathematical expectation) xi and standard deviation si of each variable (indicator);
  2. Carry out standardization processing:
  zij=(xij-xi)/si
  where: zij is the standardized variable value; xij is the actual variable value.
  3. Swap the sign in front of the inverse indicator.
  The standardized variable value fluctuates around 0, greater than 0 means above average, and less than 0 means below average.

def z_score(x, axis):
    x = np.array(x).astype(float)
    xr = np.rollaxis(x, axis=axis)
    xr -= np.mean(x, axis=axis)
    xr /= np.std(x, axis=axis)
    # print(x)
return x    

Why is the standard deviation of the z-score normalized data 1?

 

x-μ only changes the mean, the standard deviation does not change, so the mean becomes 0

(x-μ)/σ just divides the standard deviation by a factor of σ, so the standard deviation becomes 1

The two most common method usage scenarios:

1. In classification and clustering algorithms, when distance is needed to measure similarity, or when PCA technology is used for dimensionality reduction, the second method (Z-score standardization) performs better.

2. The first method or other normalization methods can be used when the distance measurement, covariance calculation, and data does not conform to the normal distribution are not involved. For example, in image processing, after converting an RGB image to a grayscale image, its value is limited to the range of [0 255].
The reason is that using the first method (after linear transformation), its covariance produces a scaling of the multiple value, so this method cannot eliminate the influence of the dimension on the variance and covariance, and has a huge impact on the PCA analysis; at the same time, due to the dimension The existence of , the calculation results of using different dimensions and distances will be different. In the second normalization method, since the variance of the new data is normalized, the dimension of each dimension is actually equivalent, and each dimension obeys a positive mean value of 0 and a variance of 1. When calculating the distance, each dimension is de-dimensionalized, which avoids the huge impact of the selection of different dimensions on the distance calculation.
[ Let's talk about the normalization method in machine learning (Normalization Method) ]

Pippi blog

log function conversion

The normalization can also be achieved by the method of converting the log function to the base of 10. The specific method is as follows:

After reading many introductions on the Internet, it is x*=log10(x), which is actually problematic. This result does not necessarily fall into the [0,1] interval. It should be divided by log10(max), and max is the sample data. The maximum value, and all data must be greater than or equal to 1.

atan function conversion

Data normalization can also be achieved using the arctangent function.

When using this method, it should be noted that if the interval to be mapped is [0, 1], the data should be greater than or equal to 0, and the data less than 0 will be mapped to the [-1, 0] interval, not all data are standardized The results are mapped to the [0,1] interval.

Decimal scaling Decimal scaling normalization

This method normalizes by shifting the decimal point of the data. How many places the decimal point is shifted depends on the largest absolute value among the values ​​of attribute A.

The calculation method to normalize the original value x of attribute A to x' using decimal scaling is:
x'=x/(10^j)
where j is the smallest integer that satisfies the condition.
For example, suppose the values ​​of A range from -986 to 917, and the maximum absolute value of A is 986. To normalize using decimal scaling, we divide each value by 1000 (ie, j=3), so that -986 is normalized to - 0.986.
Note that normalization will make changes to the original data, so it is necessary to save the parameters of the normalization method used so that subsequent data can be uniformly normalized.

Logistic/Softmax Transform

[ Sigmod/Softmax Transform ]

Fuzzy Quantization Mode

New data=1/2+1/2sin[pie 3.1415/(maximum value-minimum value)*(X-(maximum value-minimum value)/2)] X is the original data

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325661052&siteId=291194637