Data normalization methods used

Transfer: https: //www.cnblogs.com/followyourheart/articles/3349899.html

Statistical indicators are the basic elements of data analysis, comparative analysis and comprehensive analysis between the variables is the most basic, the most commonly used statistical analysis methods. When a different dimension or the nature of statistical indicators are not the same, if the data analysis directly with the raw data, tend to get irrational conclusions.

Why the data standardization

A single metric for comparison, it is assumed newborn infants weighing 3 (5,6,7) were analyzed and size (150,151,152) 3 adults difference weight, the surface, the two groups the average difference was 1 kg, which will come to the degree of difference between the two groups of the same weight is clearly inappropriate, because the weight of both not in the same grade level, that is, different dimensions;

A comprehensive analysis of several indicators, assumptions about the operating metrics of goods sold, sales or page views a comprehensive evaluation of cluster analysis, due to the level among the indexes vary greatly, if the analysis will highlight the value of direct high in a comprehensive analysis of the role of indicators, so that the various indicators ranging from the right to participate in operations.

Therefore, it is often need to standardize the data for each statistical indices by dimension treatment, eliminate the influence of the size and value of their size and influence of variation in the dimensionless variables.

Common data normalization methods

1, Max-Min standardized / normalized deviation

The observation method of subtracting the minimum value of a variable of the variable, and then divided by the deviation of the variable, its value falls normalized interval [0,1], the transfer function is: X ' = (X-min ) / (max-min), where max is the maximum value of the sample, min is the minimum value of the sample.

The method of linear transformation on the raw data, to maintain contact between the original data, which is defective when new data is added, may lead to changes in the max or min, conversion functions need to be redefined.

2、Z-score normalized / standardized standard deviation / mean normalized zero

This method will be observed that the value of a variable in the variable minus the average number, and then divided by the standard of the variable difference data normalized standard normal distribution, a mean of 0 and standard deviation 1, the conversion function is: X ' = (X-[mu]) / [sigma], where [mu] is the mean of all the sample data,σ is standard in all of the difference sample data.

The method is insensitive to outliers, the original data when the maximum value, minimum or unknown about the outlier is useful when normalized Max-Min, Z-Score standardization is currently the most widely used standardized methods.

3, the fractional scaling (decimal scaling) Standardization

The method is standardized by moving the position of the decimal point data. The maximum absolute value depends on the value of the variable number of decimal places to move. The original value of x in a variable using the fractional scaling normalized to the x 'conversion function: x' = x / (10 ^ j), where, j is satisfied that the max the smallest integer 1 holds <(| | x ') . Suppose variable X by the value of -986 to 917, 986 to its maximum absolute value, the fractional scaling using standardized, we use 1000 (i.e., j = 3) dividing each value, so that, normalized to -986 - 0.986.

Guess you like

Origin www.cnblogs.com/caicai2019/p/11010119.html