Machine Learning Complete Tutorial (4)--Feature Preprocessing

Python crawler artificial intelligence tutorial: www.python88.cn

Programming resource network: www.python66.cn

2.4 Feature Preprocessing

learning target

  • Target
    • Understand the characteristics of numerical data and categorical data
    • Use MinMaxScaler to achieve normalization of feature data
    • Standardization of feature data using StandardScaler
  • application
    • without

What is feature preprocessing?

Feature preprocessing map

2.4.1 What is feature preprocessing

# scikit-learn的解释
provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.

Translation: The process of converting feature data into feature data more suitable for the algorithm model through some conversion functions

It can be understood from the above picture

1 Included

  • Dimensionlessization of numeric data:
    • Normalized
    • standardization

2 Feature Preprocessing API

sklearn.preprocessing

Why do we want to normalize/normalize?

  • The units or sizes of features are quite different, or the variance of a feature is several orders of magnitude larger than other features , which easily affects (dominates) the target result , making it impossible for some algorithms to learn other features.

dating data

dating data

We need to use some methods for dimensionless transformation to convert data of different specifications to the same specification

2.4.2 Normalization

1 Definition

Map the data to (by default [0,1]) by transforming the original data

2 official

Normalization formula

Acting on each column, max is the maximum value of a column, min is the minimum value of a column, then X'' is the final result, mx, mi are the specified interval values. The default mx is 1, and mi is 0.

So how to understand this process? Let's go through an example

Normalized calculation process

3 API

  • sklearn.preprocessing.MinMaxScaler (feature_range=(0,1)… )
    • MinMaxScalar.fit_transform(X)
      • X: data in numpy array format [n_samples, n_features]
    • Return value: converted array of the same shape

4 Data calculation

We operate on the following data, in dating.txt. What is saved is the previous date object data

milage,Liters,Consumtime,target
40920,8.326976,0.953952,3
14488,7.153469,1.673904,2
26052,1.441871,0.805124,1
75136,13.147394,0.428964,1
38344,1.669788,0.134296,1
  • analyze

1. Instantiate MinMaxScalar

2. Convert by fit_transform

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

def minmax_demo():
    """
    归一化演示
    :return: None
    """
    data = pd.read_csv("dating.txt")
    print(data)
    # 1、实例化一个转换器类
    transfer = MinMaxScaler(feature_range=(2, 3))
    # 2、调用fit_transform
    data = transfer.fit_transform(data[['milage','Liters','Consumtime']])
    print("最小值最大值归一化处理的结果:\n", data)

    return None

Return result:

     milage     Liters  Consumtime  target
0     40920   8.326976    0.953952       3
1     14488   7.153469    1.673904       2
2     26052   1.441871    0.805124       1
3     75136  13.147394    0.428964       1
..      ...        ...         ...     ...
998   48111   9.134528    0.728045       3
999   43757   7.882601    1.332446       3

[1000 rows x 4 columns]
最小值最大值归一化处理的结果:
 [[ 2.44832535  2.39805139  2.56233353]
 [ 2.15873259  2.34195467  2.98724416]
 [ 2.28542943  2.06892523  2.47449629]
 ...,
 [ 2.29115949  2.50910294  2.51079493]
 [ 2.52711097  2.43665451  2.4290048 ]
 [ 2.47940793  2.3768091   2.78571804]]

Question: If there are many outliers in the data, what will be the impact?

The effect of outliers on normalization

5 Normalized Summary

Note that the maximum and minimum values ​​vary. In addition, the maximum and minimum values ​​are easily affected by outliers, so this method has poor robustness and is only suitable for traditional accurate small data scenarios.

How to do?

2.4.3 Standardization

1 Definition

Transform the data to a mean of 0 and a standard deviation of 1 by transforming the original data

2 official

standardized formula

Acting on each column, mean is the mean, σ is the standard deviation

So back to the point of exception just now, let's look at standardization again

Normalized Outlier Impact

  • For normalization: if there are outliers that affect the maximum and minimum values, the results will obviously change
  • For standardization: if there are abnormal points, due to a certain amount of data, a small number of abnormal points have little effect on the average, so the variance changes little.

3 API

  • sklearn.preprocessing.StandardScaler( )
    • After processing, for each column, all data are clustered around the mean 0 and the standard deviation is 1
    • StandardScaler.fit_transform(X)
      • X: data in numpy array format [n_samples, n_features]
    • Return value: converted array of the same shape

4 Data calculation

Do the same with the above data

  • analyze

1. Instantiate StandardScaler

2. Convert by fit_transform

import pandas as pd
from sklearn.preprocessing import StandardScaler

def stand_demo():
    """
    标准化演示
    :return: None
    """
    data = pd.read_csv("dating.txt")
    print(data)
    # 1、实例化一个转换器类
    transfer = StandardScaler()
    # 2、调用fit_transform
    data = transfer.fit_transform(data[['milage','Liters','Consumtime']])
    print("标准化的结果:\n", data)
    print("每一列特征的平均值:\n", transfer.mean_)
    print("每一列特征的方差:\n", transfer.var_)

    return None

Return result:

     milage     Liters  Consumtime  target
0     40920   8.326976    0.953952       3
1     14488   7.153469    1.673904       2
2     26052   1.441871    0.805124       1
..      ...        ...         ...     ...
997   26575  10.650102    0.866627       3
998   48111   9.134528    0.728045       3
999   43757   7.882601    1.332446       3

[1000 rows x 4 columns]
标准化的结果:
 [[ 0.33193158  0.41660188  0.24523407]
 [-0.87247784  0.13992897  1.69385734]
 [-0.34554872 -1.20667094 -0.05422437]
 ...,
 [-0.32171752  0.96431572  0.06952649]
 [ 0.65959911  0.60699509 -0.20931587]
 [ 0.46120328  0.31183342  1.00680598]]
每一列特征的平均值:
 [  3.36354210e+04   6.55996083e+00   8.32072997e-01]
每一列特征的方差:
 [  4.81628039e+08   1.79902874e+01   2.46999554e-01]

5 Standardization Summary

It is relatively stable when there are enough samples, and is suitable for modern noisy big data scenarios.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326988026&siteId=291194637