Sklearn data set division and data preprocessing

1. Data set division

The general data set of machine learning will be divided into two parts:

  • Training data: used for training and building models
  • Test data: used in model testing to evaluate whether the model is valid

Division ratio:

  • Training set: 70% 80% 75%
  • Test set: 30% 20% 25%

Data set division api

  • sklearn.model_selection.train_test_split(arrays, *options)
    • parameter:
      • x eigenvalues ​​of the data set
      • y The label value of the data set
      • test_size The size of the test set, generally float
      • random_state random number seed, different seeds will cause different random sampling results. The same seed sampling results are the same.
    • return
      • x_train, x_test, y_train, y_test
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# 1、获取鸢尾花数据集
iris = load_iris()
# 对鸢尾花数据集进行分割
# 训练集的特征值x_train 测试集的特征值x_test 训练集的目标值y_train 测试集的目标值y_test
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=22)

 

2. Data normalization

It can be seen that the distribution of the four feature data of iris is not in the same interval, which is not conducive to the training of the machine learning model, so the feature data needs to be normalized.

  • Why do we need to normalize?

    • The unit or size of the feature is quite different, or the variance of a feature is several orders of magnitude larger than other features , which easily affects (dominates) the target result , making some algorithms unable to learn other features

definition

Map the data to (the default is [0,1]) by transforming the original data,

official

API:

sklearn.preprocessing.MinMaxScaler (feature_range=(0,1) )

  • MinMaxScalar.fit_transform(X)
    • X: numpy array format data [n_samples,n_features]
  • Return value: Array with the same shape after conversion
iris_data = pd.DataFrame(iris["data"], columns=['sepal length',
                                                 'sepal width',
                                                 'petal length',
                                                 'petal width'])

def min_max_demo(data):
    """
    演示数据归一化
    """
    transfer = MinMaxScaler(feature_range=(0,1))
    res_data = transfer.fit_transform(data[['sepal length',
                                            'sepal width',
                                            'petal length',
                                            'petal width']])
    print(res_data)


print(min_max_demo(iris_data))

 

3. Standardization

Question: If there are many abnormal points in the data, what will be the impact?

Note: The maximum and minimum values ​​are variable. In addition, the maximum and minimum values ​​are very easily affected by abnormal points, so this method is less robust and is only suitable for traditional precise small data scenarios.

definition:

            By transforming the original data, the data is transformed to the range of 0 mean and 1 standard deviation.

official:

           

  • For normalization: if there are abnormal points that affect the maximum and minimum values, the results will obviously change
  • For standardization: if there are abnormal points, due to a certain amount of data, a small number of abnormal points have little effect on the average, so the variance of the variance is small.

API:

sklearn.preprocessing.StandardScaler( )

  • After processing, for each column, all data are clustered around the mean 0 and the standard deviation is 1.
  • StandardScaler.fit_transform(X)
    • X: numpy array format data [n_samples,n_features]
  • Return value: Array with the same shape after conversion
iris_data = pd.DataFrame(iris["data"], columns=['sepal length',
                                                 'sepal width',
                                                 'petal length',
                                                 'petal width'])


def stand_demo(data):
    """
    演示数据标准化
    """
    transfer = StandardScaler()
    res_data = transfer.fit_transform(data[['sepal length',
                                            'sepal width',
                                            'petal length',
                                            'petal width']])
    print(res_data)
    print('每一列的方差为:\n:', transfer.mean_)
    print('每一列的标准差为:\n:', transfer.var_)

Guess you like

Origin blog.csdn.net/qq_39197555/article/details/115202257