06-Data normalization (Feature Scaling)

Data normalization (Feature Scaling)

  When we used the kNN algorithm to complete the task of classification, in fact, we did not do a very important step, which is the data normalization (Feature Scaling) .

  First, let's take a look, why do we need to do data normalization? I still give the example of a tumor. Suppose we have two characteristics, one is the size of the tumor (in cm) and the other is the number of days it was discovered (in days). The size of the tumor in sample 1 is 1 cm and it has been found for 200 days, and the size of the tumor in sample 2 is 5 cm and it has been found for 100 days.

Insert picture description here
  So what is the distance between these two samples? If we use Euler distance, it is sqrt( (1-5)^2 + (200-100)^2 ). It can be easily found that this distance is dominated by the time of discovery. The time interval of our discovery is 100 days, and the size of the tumor is only 4cm apart, so the distance between the samples is dominated by the time of discovery.

  However, if we express the unit of discovery time in years, as shown in the figure below. At this time, the distance between samples is again dominated by the size of the tumor.
Insert picture description here
  Therefore, we found that if we do not perform some basic data processing, we directly calculate the distance between the two samples in this way. It is very likely that there is a deviation, and it cannot reflect the importance of each feature very well. It is precisely because of this that we have to normalize the data.

  The so-called data normalization processing, its function is to map all our data to the same scale . Usually, the simplest way of this mapping is to map all the data to a value between 0-1, which is called the most value normalization . It can be expressed by the following expression.
Insert picture description here

  This method is actually a relatively simple method. It is suitable for situations where the distribution has obvious boundaries (for example, test scores have obvious score boundaries: 0-100), but the maximum value normalization also has a disadvantage, that is , it is affected by outlier Larger (for example, income has no obvious boundaries).

  A corresponding improvement is to use mean variance normalization (standardization) , which is to normalize all data to a distribution with a mean of 0 and a variance of 1 . In other words, the result of using this method is that our data is not guaranteed to be between 0-1, but the mean of all data is at 0, and the variance of the overall data is 1. This method is suitable for data distribution without obvious boundaries . In other words, our data may have extreme data values ​​outlier, and it is better to use this method. If our data has obvious boundaries, it is better to use this method, so it is recommended here, unless, as in the previous example, the feature distribution of the student’s test score or the image pixel (0-255) is clear In these cases of the boundary, we generally use the mean variance normalization method. The calculation method for this method is as follows. (Xmean is the mean, S variance)
Insert picture description here


Implementation

Below, we will implement these two normalization processes.

Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here

Using this method of mean variance normalization, its essence is to place the mean (center value) of all our data at the position of 0, and the amplitude (variance) of its distribution at the position of 1. Even if there are extreme values ​​such as outlier in our data, our data as a whole still meets the mean value of 0 and the variance of 1, which means that it will not form a biased data, which is also normalized compared to the maximum value. Said its advantages.


For the specific code, see 07 Data normalization.ipynb

Guess you like

Origin blog.csdn.net/qq_41033011/article/details/108973224