Explain in an easy-to-understand way: data preprocessing normalization (with Python code)

Technical Q&A

This article comes from the sharing of the technical group friends. If you want to join, follow the steps below

At present, a technical exchange group has been opened, and the group has more than 3,000 people. The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends

Method ①, add WeChat ID: dkl88191, remarks: from CSDN + technical exchange
Method ②, WeChat search public number: Python learning and data mining, background reply: add group + CSDN

1. Why do data preprocessing?

  • Any huge data collected is often impossible to use immediately after getting it. For example, some data with large values ​​have high computational complexity, are not easy to converge, and are difficult to perform statistical processing.

  • The data does not conform to the normal distribution, and some mathematical analysis that conforms to the normal distribution cannot be done.

So in order to make better use of the data, we need to standardize the data.

2. Data Standardization

Data dimensionless processing mainly solves the comparability of data. There are many methods of data normalization, commonly used are "minimum-maximum normalization", "Z-score normalization" and "decimal scale normalization".

After the above standardization process, the original data are converted into dimensionless index evaluation values, that is, each index value is at the same level of magnitude, and comprehensive evaluation and analysis can be carried out. Here we focus on the most commonly used data normalization processing, which is to uniformly map the data to the [0,1] interval.

1. Normalized goals

1. Converting the data to decimals in the (0,1) interval is mainly proposed for the convenience of data processing. It is more convenient and fast to map the data to the range of 0 to 1 for processing.

2. Change the dimensional expression into a dimensionless expression to solve the comparability of data.

2. The advantages of normalization

1. After normalization, the speed of gradient descent to find the optimal solution is accelerated. If the machine learning model uses the gradient descent method to find the optimal solution, normalization is often very necessary, otherwise it will be difficult to converge or even unable to converge.

2. Normalization has the potential to improve accuracy, some classifiers need to calculate the distance between samples (such as Euclidean distance), such as KNN. If the range of a feature range is very large, the distance calculation mainly depends on this feature, which is contrary to the actual situation (for example, the actual situation is that the feature with a small range range is more important).

3. Which algorithms do not require normalization

Probabilistic models (tree models) do not need normalization because they do not care about the value of the variable, but about the distribution of the variable and the conditional probability between the variables, such as decision trees, RF. And optimization problems like Adaboost, SVM, LR, Knn, KMeans require normalization.

Three, data normalization method

1.min-max normalization

By traversing each data in the feature vector, Max and Min are recorded, and the data is normalized by using Max-Min as the base (ie Min=0, Max=1): where Max is the maximum value of the sample data value, Min is the minimum value of the sample data.

def MaxMinNormalization(x,Max,Min):
 
    x = (x - Min) / (Max - Min);
 
    return x;

Use np.max() and np.min() in numpy to find the maximum and minimum values. This normalization method is more suitable for the case of numerical comparison. This method has a flaw. If max and min are unstable, it is easy to make the normalization result unstable, and the subsequent use effect is also unstable. In actual use, max and min can be replaced by empirical constant values.

ps: The method of normalizing the data to the [a,b] interval range:

(1) First find the minimum value Min and the maximum value Max of the original sample data X
(2) Calculate the coefficient: k=(ba)/(Max-Min)
(3) Obtain the data normalized to the [a, b] interval : Y=a+k(X-Min) or Y=b+k(X-Max)

2. Z-score normalization

The most common standardization method is Z standardization, which is also the most commonly used standardization method in SPSS. The default standardization method of spss is z-score standardization.

Also called standard deviation standardization, this method gives the mean and standard deviation of the original data to standardize the data.

def  Z_ScoreNormalization(x,mu,sigma):
 
    x = (x - mu) / sigma;
 
    return x;

The mean and std functions in numpy and the StandardScaler method provided by sklearn can both obtain the mean and standard deviation. The standardized variable value fluctuates around 0, greater than 0 means above average, and less than 0 means below average.

The following uses numpy to achieve standard deviation standardization of a matrix

import numpy as np

x_np = np.array([[1.5, -1., 2.],
[2., 0., 0.]])
mean = np.mean(x_np, axis=0)
std = np.std(x_np, axis=0)
print('The initial value of the matrix is: {}'.format(x_np))
print('The mean value of the matrix is: {}\n The standard deviation of the matrix is: {}'.format(mean,std ))
another_trans_data = x_np - mean
another_trans_data = another_trans_data / std
print('The standard deviation normalized matrix is: {}'.format(another_trans_data))

The initial value of the matrix is: [[ 1.5 -1. 2. ]
[ 2. 0. 0. ]]
The mean value of the matrix is: [ 1.75 -0.5 1. ]
The standard deviation of the matrix is: [0.25 0.5 1. ]
Standard The difference normalized matrix is: [[-1. -1. 1.]
[ 1. 1. -1.]]

The StandardScaler method provided by sklearn is used below

from sklearn.preprocessing import StandardScaler # Standardization tool
import numpy as np

x_np = np.array([[1.5, -1., 2.],
[2., 0., 0.]])
scaler = StandardScaler()
x_train = scaler.fit_transform(x_np)
print('The initial value of the matrix is :{}'.format(x_np))
print('The mean of this matrix is: {}\n The standard deviation of this matrix is: {}'.format(scaler.mean_,np.sqrt(scaler.var_)))
print('The standardized matrix of standard deviation is: {}'.format(x_train))

The initial value of the matrix is: [[ 1.5 -1. 2. ]
[ 2. 0. 0. ]]
The mean value of the matrix is: [ 1.75 -0.5 1. ]
The standard deviation of the matrix is: [0.25 0.5 1. ]
Standard The difference normalized matrix is: [[-1. -1. 1.]
[ 1. 1. -1.]]

To find that sklearn's standardization tool will have two attributes after instantiation, one is mean_ (mean) and one is var_ (variance). The final result is the same as using numpy.

Why is the standard deviation of the z-score normalized data 1?

x-μ only changes the mean, the standard deviation does not change, so the mean becomes 0; (x-μ)/σ just divides the standard deviation by σ times, so the standard deviation becomes 1.

3.Sigmoid function:

The sigmoid function is a function with an S-shaped curve and is a good threshold function. It is centrally symmetric at (0, 0.5) and has a relatively large slope around (0, 0.5), and when the data tends to positive infinity and negative infinity When , the mapped value will tend to 1 and 0 infinitely. According to the change of the formula, the segmentation threshold can be changed. Here as a normalization method, we only consider the case of (0, 0.5) as the segmentation threshold:

from matplotlib import pyplot as plt
import numpy as np
import math


def sigmoid_function(z):
    fz = []
    for num in z:
        fz.append(1 / (1 + math.exp(-num)))
    return fz


if __name__ == '__main__':
    z = np.arange(-10, 10, 0.01)
    fz = sigmoid_function(z)
    plt.title('Sigmoid Function')
    plt.xlabel('z')
    plt.ylabel('σ(z)')
    plt.plot(z, fz)
    plt.show()

Summarize

The main thing is to discover the concept of data standardization after the StandardScaler method provided by sklearn in machine learning, and to further understand the Friedman test.

Guess you like

Origin blog.csdn.net/m0_59596937/article/details/127181149