07-Scaler in scikit-learn

Scaler in scikit-learn

  In the previous blog, I introduced two ways to normalize data, the maximum value normalization and the mean variance normalization . However, there is a very important consideration when specifically applying this normalization method in our machine learning process.

  For our original data set, we divide it into training data set and test data set. If we want to use normalized data to train our model, obviously we need to normalize the training data set. For example, we use the processing method of mean variance normalization, correspondingly we need to request the mean value of the training data set (mean_train) and the variance of the training data set (std_train). After we normalize, we use this data set to train the model. Finally, we need to use the model we trained to predict the data. For this test data set, we need to use the same data normalization method. Normalization, then the question is how should we perform normalization for the test data set? Maybe most people think this way. We directly find the mean (mean_test) and variance (std_test) of the entire test data set, and then send them to the model for testing. Isn't it ok? There is a trap here, we must pay attention to it, it is wrong to do so.
Insert picture description here
  Our correct approach should be to normalize the mean_train and std_train obtained from our test data set using the training data set accordingly. That is (x_test - mean_train) / std_train. Why do you want to do this?

  There are several reasons. First of all, the main reason is that the test data simulates the real environment . We divided a part of the test data set from the original data. For this test data set, it is indeed easy to get its mean and variance, but don’t forget that we trained this model to make it in a real environment Use, but in many cases the real environment may not be able to get the mean and variance of all test data . For a simple example, for our iris flower recognition, although we can get the corresponding average characteristics of all iris flowers in the test data, in actual use, we only come to one flower every time , Then what is the average value of this flower? What is the variance? We cannot obtain such statistics. Therefore, in actual use, a new iris flower came. We need to normalize the features of this new iris flower. We can only subtract the mean_train corresponding to the training data set from its features. Divide by the variance corresponding to the training data set.

  Another point, normalizing our data is also part of the algorithm. In other words, the algorithm includes subtracting mean_train from all the data, and then dividing by std_train. For all the subsequent data, we should also use the same method to process it, and then test its accuracy, and the result is The true accuracy of this algorithm.

  Understand this, the normalization of the test data set. We first need to save the mean and variance obtained from the training data set. In order to facilitate this step of operation, sklearn specifically encapsulates a class Scaler for data normalization. The sklearn package is to find a way to make the Scaler class consistent with the overall use process of our machine learning algorithm. The following figure shows the process of Scaler.

Insert picture description here
  The whole process is to pass our training data set X_train, y_train into Scaler, and then there is also a fit method in Scaler. The fit algorithm is to find some statistical indicators corresponding to this training data set, such as mean and variance. After that, the key information is saved in the Scaler, and after other samples are added, the Scaler can transform the input samples to obtain the corresponding output results. Comparing the machine learning algorithm flow, just change predict to transform.

Below I will use the actual code to look at sklearn's encapsulation of data normalization.

Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here

From the above test, it can be found that if we only normalize the training data set without normalizing the test data set, the accuracy rate obtained is only 0.33, so we must pay attention to the normalization of both data sets. To be normalized .


Package SatndardScaler

After reading the use of Scaler in sklearn, let's package a StandardScaler ourselves.

# preprocessing.py

import numpy as np

class StandardScaler:
    def __init__(self):
        self.mean_ = None
        self.scale_ = None
    def fit(self, X):
        """根据训练数据集X获得数据的均值和方差,只处理二维数据"""
        assert X.ndim == 2, "The dimension of X must be 2"
        self.mean_ = np.array([np.mean(X[:, i]) for i in range(X.shape[1])])
        self.scale_ = np.array([np.std(X[:, i]) for i in range(X.shape[1])])

        return self

    def transform(self, X):
        """将X根据这个StandardScaler进行均值方差归一化处理"""
        assert X.ndim == 2, "The dimension of X must be 2"
        """而且fit必须在transform之前执行,所以mean_和scale_必须是非空的"""
        assert self.mean_ is not None and self.scale_ is not None, \
            "must fit before transform!"
        assert X.shape[1] == len(self.mean_), \
            "the feature nunmber of X must be equal to mean_ and std_"
        resX = np.empty(shape=X.shape, dtype=float)
        for col in range(X.shape[1]):
            resX[:, col] = (X[:, col] - self.mean_[col]) / self.scale_[col]
        return resX

test:

Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here


For the specific code, see Scaler.ipynb in 08 scikit-learn + StandardScaler.ipynb in 08 test package

Guess you like

Origin blog.csdn.net/qq_41033011/article/details/108975425