Data normalization for machine learning basics

Reason for normalization

When performing machine learning training, a data set usually contains multiple different features. For example, in the soil heavy metal data set, each sample represents a sampling point, which contains features such as longitude, latitude, altitude, and different heavy metal contents. There is a large difference in the dimensions used by the features, which leads to a large difference between the values ​​​​under different features. When using this data set for experiments, it is very likely that the influence of some characteristic indicators with small numerical variation intervals on the target characteristic data is ignored, which directly affects the results of the experiment.
Data before normalization:
insert image description here

In order to solve the above problems, before using this data set for related experiments, it is usually necessary to use normalization methods to preprocess the data. The normalization method is a basic work in machine learning, which can be commonly understood as classifying different data into the same category. There are two forms of normalization methods, one is to map all data to the range of 0 to 1 through mathematical methods to facilitate processing, and the other is to change dimensioned expressions into dimensionless expressions . Since in most cases it is enough to map all data to the range of 0 to 1 when performing machine learning, several normalization methods in this form will be introduced separately.
Normalized data:
insert image description here

1. Maximum and minimum normalization

Maxmin normalization. This method is the simplest method. It mainly needs to traverse all the values ​​of this characteristic variable for each characteristic variable, and then save the maximum and minimum values. By calculating the relationship between each value in this characteristic variable and The ratio relationship between the maximum value and the minimum value is used to map this value to the interval 0 to 1. The specific calculation formula is as follows:
insert image description here

Where x represents the original data, x_min represents the minimum value under this feature variable, x_max represents the maximum value under this feature variable, and x^* represents the normalized data.

Since the normalization method maps the value to the interval between 0 and 1, and in the training process, it is necessary to obtain the output value through the calculation of the input features and parameters to fit the target value, so the target feature variable should also be normalized Processing, and at this time the training parameter value is optimized for the normalized data. In order to obtain the data under the original dimension when using the trained model for prediction, it is necessary to denormalize the calculated data. The denormalization calculation method under this normalization method is as follows:
insert image description here

2. Z-score standardization

The biggest difference between this method and the maximum-minimum normalization method is that the maximum-minimum normalization method uses the maximum and minimum values ​​under the same characteristic variable, while this method uses the mean and standard deviation under the same characteristic variable , the data normalized by this normalization method conforms to the standard normal distribution with a mean of 0 and a standard value of 1 in terms of data distribution. The calculation formula for this normalization method is as follows:
insert image description here

Among them, μ represents the numerical mean under this characteristic variable, and σ represents the numerical standard deviation under this characteristic variable.

Similarly, when training the target feature variable, it is also necessary to denormalize the calculated results to obtain the data under the original dimension. The denormalization formula corresponding to this normalization method is as follows:
insert image description here

3. Application of different methods

In addition to the above two normalization methods, there are also some normalization methods, such as Sigmod function conversion, log function conversion, and arctangent function conversion. The size maps to the interval 0 to 1.
Due to the different implementation methods of different normalization methods, their application scenarios in solving practical problems are also different. For example, when dealing with classification and clustering problems, it is necessary to use distance values ​​​​to measure the similarity between different variables. At this time Using the Z-score normalization method to normalize the data can achieve better results, and when the distance measure is not involved or the distribution of the data does not conform to the normal distribution, it is more appropriate to use the maximum and minimum normalization method. When using the collaborative composite neural network model to predict soil heavy metal content, the data used does not involve measurement, so the data normalization method adopted is the maximum and minimum normalization.

Guess you like

Origin blog.csdn.net/weixin_42051846/article/details/130441924