Standardization/normalization in machine learning

Data normalization is to scale the data to a small specific interval. It is often used in some data comparison and evaluation. There are typical normalization methods, as well as extreme value methods and standard deviation methods.

There are two main forms of the normalization method: one is to change the number to a decimal between (0, 1), and the other is to change the dimensional expression to a non-dimensional expression. It is an effective way to simplify calculations in digital signal processing.

The benefits of normalization:
1 Speed ​​up the solution speed of gradient descent, that is, increase the convergence speed of the model

When the difference between the two feature intervals is very large, such as x1[0-2000] and x2[1-5] in the left picture, the contours formed are elliptical, and it is very likely to follow the "Zigzag" route (vertical Long axis), resulting in many iterations to converge.

The figure on the right normalizes the two features, and the corresponding contour lines become rounded, which can converge faster when the gradient descent is solved.

Therefore, when the gradient descent method is used in machine learning to find the optimal solution, normalization is also necessary, otherwise the model is difficult to converge or even sometimes fails.

 

2 It is possible to improve the accuracy of the model

Some classifiers need to calculate the distance between samples. If the value range of a feature is very large, the distance calculation will mainly depend on this feature and sometimes deviate from the actual situation.

The type and meaning of normalization/normalization processing:
linear normalization (min-max normalization)

This kind of normalization is suitable for the situation where the values ​​are relatively concentrated. The disadvantage is that if max and min are unstable, it is easy to make the normalization result unstable and make the subsequent effect unstable. In actual use, you can use empirical constants to replace max and min. .

# 使用scikit-learn函数
min_max_scaler = preprocessing.MinMaxScaler()
feature_scaled = min_max_scaler.fit_transform(feature)
# 使用numpy自定义函数
def min_max_norm(x):
    x = np.array(x)
    x_norm = (x-np.min(x))/(np.max(x)-np.min(x))
    return x_norm


Standard deviation standardization (z-score standardization)

The processed data conforms to the standard normal distribution, with a mean of 0 and a standard deviation of 1.

# 使用scikit-learn函数
standar_scaler = preprocessing.StandardScaler()
feature_scaled = standar_scaler.fit_transform(feature)
# 使用numpy自定义函数
def min_max_norm(x):
    x = np.array(x)
    x_norm = (x-np.mean(x))/np.std(x)
    return x_norm


Non-linear normalization

It is often used in scenarios with large data differentiation, some of which have large values ​​and some of which are small. Through some mathematical functions, the original value is mapped. The methods include log, exponent, arctangent and so on. The curve of the nonlinear function needs to be determined according to the data distribution.

Log function: x = lg(x)/lg(max); arctangent function: x = atan(x)*2/pi
 

 

Reprinted from: https://blog.csdn.net/index20001/article/details/78044971

Guess you like

Origin blog.csdn.net/xiezhen_zheng/article/details/84560144