[Machine learning notes] Data standardization methods

Data standardization methods

Data standardization

Before data analysis, we usually need to normalize the data (Normalization) , and use the standardized data for data analysis . Data standardization is the indexation of statistical data. Data standardization processing mainly includes two aspects of data co-trending processing and dimensionless processing.

At present, there are many methods of data standardization, which can be divided into linear methods (such as extreme value method and standard deviation method), polyline methods (such as trifold method), and curvilinear methods (such as seminormal distribution). Different standardization methods will have different effects on the evaluation results of the system, but unfortunately, there is no general rule that can be followed in the selection of data standardization methods.


Why do we need to standardize data?

The so-called multi-index comprehensive evaluation method is to synthesize the information of multiple indexes describing different aspects of the evaluation object, and obtain a comprehensive index, thereby making an overall evaluation of the evaluation object and making a horizontal or vertical comparison.

In a multi-index evaluation system, due to the different nature of the evaluation indexes, they usually have different dimensions and orders of magnitude. When the level of each index differs greatly, if the original index value is used for analysis directly, it will highlight the role of the index with higher value in the comprehensive analysis, and relatively weaken the role of the index with lower value level. Therefore, in order to ensure the reliability of the results, it is necessary to standardize the original index data.


Data standardization methods

The data co-convergence process mainly solves the problem of data of different natures. The direct summation of indicators of different natures cannot correctly reflect the comprehensive results of different forces. It is necessary to consider changing the nature of the inverse indicators data first, so that all indicators have the same effect on the evaluation plan , And then add up to get the correct result.

Data dimensionless processing mainly solves the comparability of data.

There are many methods of data standardization, commonly used are " Min-max Normalization " ( Min-max Normalization ), "Z-score normalization" and "standardization by decimal calibration". After the above standardization processing, the original data are converted into dimensionless index evaluation values, that is, each index value is on the same quantitative level, and comprehensive evaluation and analysis can be performed.

1. Min-max Normalization

Also called dispersion standardization, it is a linear transformation of the original data, so that the result falls in the interval [0,1], the conversion function is as follows:

 

Dispersion standardization

  • Where max is the maximum value of the sample data and min is the minimum value of the sample data.
  • One disadvantage of this method is that when new data is added, it may cause changes in max and min, which needs to be redefined.

2. Log function conversion

The log function conversion method based on 10 can also be achieved. The specific method has been read on the Internet. Many introductions are x = log10 (x) . In fact, there is a problem. This result does not necessarily fall to [0,1 ] Interval should be divided by log10 (max) *, max is the maximum value of sample data, and all data must be greater than or equal to 1.

3. Atan function conversion

The inverse tangent function can also be used to normalize the data. When using this method, it should be noted that if the interval to be mapped is [0,1], the data should be greater than or equal to 0, and the data less than 0 will be mapped to [- 1,0] interval.

4. z-score standardization (zero-meannormalization)

But not all the results of data standardization are mapped to the [0,1] interval. The most common standardization method is Z standardization; it is also the most commonly used standardization method in SPSS, also called standard deviation standardization .

z-score standardization

  • This method normalizes the data based on the mean and standard deviation of the original data. Normalize the original value x of A to x 'using z-score.
  • The z-score standardization method is suitable for cases where the maximum and minimum values ​​of attribute A are unknown, or when there is outlier data that exceeds the value range.
  • The default standardization method of spss is z-score standardization.
  • The method of standardizing z-score with Excel: There is no ready-made function in Excel, you need to calculate it step by step. In fact, the standardized formula is very simple.

The steps are as follows:
1. Find the arithmetic mean (mathematical expectation) xi and standard deviation si of each variable (indicator);
2. Perform standardization:
zij = (xij-xi) / si
where: zij is the variable value after normalization ; Xij is the actual variable value.
3. Reverse the sign before the inverse indicator.
The standardized variable value fluctuates up and down around 0. A value greater than 0 means higher than the average level, and a value less than 0 means lower than the average level.

5. Normalization method

 


 

Published 646 original articles · praised 198 · 690,000 views

Guess you like

Origin blog.csdn.net/seagal890/article/details/105312351