Machine learning-normalization

This article will explore why some machine learning models need to normalize the data?

The explanation given by Wikipedia:
1) After normalization, the speed of gradient descent to find the optimal solution is accelerated;
2) Normalization may improve the accuracy.

The following briefly expands to explain these two points.

1. Why can normalization increase the speed of the gradient descent method to solve the optimal solution?

The Stanford machine learning video has a good explanation: https://class.coursera.org/ml-003/lecture/21
Insert picture description here

As shown in the figure, the blue circles represent the contour lines of the two features. Among them, the two features on the left are X 1 X_1X1And X 2 X_2X2The interval of is very different, X 1 X_1X1The interval is [0,2000], X 2 X_2X2The interval is [1,5], and the contour formed by it is very sharp. When using the gradient descent method to find the optimal solution, it is very likely to take the "Zigzag" route (walking along the vertical contour), which leads to the need for many iterations to converge;
and the right figure normalizes the two original features When the gradient descent is solved, the corresponding contour lines appear very round.
Therefore, if the machine learning model uses the gradient descent method to find the optimal solution, normalization is often very necessary, otherwise it is difficult or even impossible to converge. For example, when I use Unet for weather forecast mode fusion, if I do not normalize the value of the training target's two-meter temperature and directly use the original [200-300] Kelvin temperature for training, the loss will not converge, and finally Even reach inf.

2 Normalization may improve accuracy

Some classifiers need to calculate the distance between samples (such as Euclidean distance), such as KNN. If a characteristic value range is very large, then the distance calculation mainly depends on this characteristic , which is contrary to the actual situation (for example, at this time, the actual situation is that the characteristic with a small value range is more important).

3 Types of normalization

1) Linear normalization
x ′ = x − min (x) max (x) − min (x) x'= \frac{x-\text{min}(x)}{\text{max}(x) -\text{min}(x)}x=max(x)min ( x )xmin ( x )

This normalization method is more suitable for situations where the values ​​are relatively concentrated. This method has a flaw. If max and min are unstable, it is easy to make the normalized result unstable and make the subsequent use effect unstable. In actual use, empirical constant values ​​can be used to replace max and min.

2) Standard deviation standardization    The
  processed data conforms to the standard normal distribution, that is, the mean is 0 and the standard deviation is 1. The conversion function is:
x ∗ = x − μ σ x^* = \frac{x-\mu}{ \sigma}x=σxm

Where μ is the mean of all sample data, and σ is the standard deviation of all sample data.

3) Non-linear normalization is
often used in scenarios with large data differentiation , some of which have large values ​​and some of which are small. Through some mathematical functions, the original value is mapped. The methods include log, exponent, tangent and so on.

4. Interview questions

Q: Which machine learning algorithms do not need to be normalized?
A: The tree model does not need to be normalized. More broadly, the probability model does not need to be normalized. Because they don't care about the value of variables, but care about the distribution of variables and the conditional probability between variables, such as decision trees and random forests. However, optimization problems such as svm, lr, KNN, and KMeans need to be normalized.
The reason why the tree model does not need to be normalized is:

  • Numerical scaling does not affect the position of the split point. Because when a node of the tree is split, it is sorted according to the value of the feature . If the sorting order remains the same, the split point will not be different. But for linear models, such as LR, there are two features, one is (0,1) and the other is (0,10000), so when using gradient descent, the loss contour is an elliptical shape, think like this Iterating to the best point requires many iterations, but if normalization is carried out, then the contour is circular, then SGD will iterate to the origin, which requires fewer iterations.
  • In addition, note that the tree model cannot perform gradient descent, because the tree model is stepped, the step point is not diversified , and the derivation is meaningless, so the tree model (regression tree) finds the best point by looking for the optimal split Click Finished.

Q: Do SVM and LR need to be normalized?
A: After some models are scaled unevenly in each dimension, the optimal solution is not equivalent to the original one (such as SVM) and needs to be normalized. Some model scaling is equivalent to the original, such as: LR does not need to be normalized, but in practice the model parameters are often solved iteratively. If the objective function is too flat (imagine a very flat Gaussian model), the iterative algorithm will not converge. , So it is best to normalize the data.
Supplement: In fact, the essence is caused by different loss functions . SVM uses Euler distance. If one feature is large, other dimensions will be dominated. And LR can make the loss function unchanged through weight adjustment.


Reference materials
[1] https://www.cnblogs.com/LBSer/p/4440590.html

Guess you like

Origin blog.csdn.net/weixin_41332009/article/details/113833669