When do machine learning models need data normalization?

machine learning

Author:louwill

Machine Learning Lab

     

     Friends have always had questions when making machine learning models: Do I need to standardize my data?

     The author has also thought about this question, but it is not systematic enough, and the point of view is relatively single, so I have the answer of [the order of magnitude difference between the variable units is too large] in the above figure. On this topic, the author consults relevant materials and makes a detailed elaboration on this issue.

What is data normalization

     In the complete machine learning process, data standardization has always been an important process. Generally, we put data standardization in the preprocessing process and exist as a general technique. But many times we don't know why we need to standardize the data, whether the performance of the standardized model will be improved.

     The straightforward definition of data normalization is as follows:

     That is, for each data set feature, subtract the feature mean and divide by the feature standard deviation. Data standardization can transform the corresponding feature data to have a mean of 0 and a variance of 1. After data normalization, all features of the dataset have the same range of variation.

     One of the most direct application scenarios of data standardization is: when there is a large difference in the value range of each feature of the data set, or when the value unit of each feature has a large difference, we need to use standardization to preprocess the data.

   For example, a data contains two features, one of which has a value range of 5000~10000, and the other feature has a value range of only 0.1-1. In fact, during modeling and training, no matter what model, the first feature The impact on the model results will be greater than the second feature, such a model is difficult to effectively make accurate predictions.

Difference from data normalization

     Data normalization is also a data preprocessing technique. But for a long time, we are stupidly unclear about standardization and normalization, and there is a long-term mixed use situation. The calculation formula of data normalization is as follows:

     or:

     The author consulted relevant materials and found that there has been no unified definition for these two data transformation methods. In many cases, the concepts of normalization and normalization are mixed. Sometimes the z-score transformation is called normalization, and sometimes the min-max normalization is called normalization. Through comparison, the author believes that standardization refers to the z-score transformation, which is the first formula mentioned above. Normalization refers to the min-max transformation, which is the second or third formula above.

     Data normalization In order to have comparability between different features, the feature distribution after normalization transformation does not change. The purpose of data normalization is to make the impact of each feature on the target variable consistent, and the feature data will be scaled and changed, so data normalization will change the distribution of feature data.

Which models are more sensitive to normalization?

     Some models in machine learning are based on distance metrics for model prediction and classification. Since distance is very sensitive to the different value ranges between features, it is necessary to standardize the data for models based on distance readings.

     The most typical distance metric-based models include k-nearest neighbors, kmeans clustering, perceptrons, and SVMs. In addition, several models of the linear regression class generally require data standardization. Integrated learning models such as logistic regression, decision tree, decision tree-based Boosting, and Bagging are not sensitive to the size of feature values. Therefore, such models generally do not require data normalization. In addition, data with more categorical variables does not need to be standardized.

in conclusion

     The conclusion is that when the data feature value range or unit difference is large, it is best to do standardization. Models such as k-nearest neighbors, kmeans clustering, perceptrons, SVMs, and linear regressions generally require data standardization. In addition, it is best to distinguish between data normalization and data normalization.

References:

https://towardsai.net/p/data-science/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff

Past highlights:

[Original first release] Machine learning formula derivation and code implementation 30 lectures.pdf

[Original first release] Deep Learning Semantic Segmentation Theory and Practical Guide.pdf

Click it if you like it!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324482515&siteId=291194637