Machine Learning 02: Feature Preprocessing

What is feature preprocessing

Convert the data into the data required by the algorithm through specific statistical methods (mathematical methods).
For numerical data , the method we use is standard scaling :
1. Normalization
2. Standardization
3. Missing values
​​For categorical data : we Use one-hot encoding
For time types: use time segmentation

Normalized

Features: By transforming the original data, the data is mapped to (the default is [0, 1])
Formula:
insert image description here

Note: Acting on each column, max is the maximum value of a column, min is the minimum value of a column, x"
is the final result, mx and mi are the specified interval values ​​respectively, and the default mx is 1 and mi is 0

So why do you need to normalize the data? Let’s look at an example.
This is a set of date data.
insert image description here
This sample is male data. It contains three features, the percentage of time spent playing games, the number of frequent flyer miles earned each year, Liters of ice cream consumed per week. Then there is a category, the three categories evaluated by women, didnt like, attractive, small, very attractive, maybe it means that the flight mileage has a greater impact on the settlement result or the blind date result, but the statistical people think this All three features are equally important.
When the eigenvalues ​​are equally important, the data needs to be normalized!
If there is no normalization, there will be problems in data classification, such as in the k-nearest neighbor algorithm
(it doesn’t matter if you don’t know the k-nearest neighbor algorithm)
For example, for the two red data in the figure, in the k-nearest neighbor algorithm, the "distance" between the two samples is:
((72993-35948)^2 )+ ((10.14-6.8)^2) + ((1.0-1.21) ^2)
That is, the square of each eigenvalue of each piece of data is subtracted and then summed.
Obviously, the last two terms of the polynomial have little effect on the value of the entire polynomial, so the data is not equally important.


The code is as follows

from sklearn.preprocessing import MinMaxScaler

def mm():
    """
    归一化处理
    """
    mm = MinMaxScaler()
    # MinMaxScaler默认范围为[0,1],如果需要更改,可以加上参数
    # MinMaxScaler(feature_range=(2, 3))
    
    data = mm.fit_transform(
        [[90, 2, 10, 40],
         [60, 4, 15, 45],
         [75, 3, 13, 46]]

    )
    print(data)
	"""
	[[1.         0.         0.         0.        ]
 	[0.         1.         1.         0.83333333]
 	[0.5        0.5        0.6        1.        ]]
	"""

if __name__ == '__main__':
	mm()

However, the maximum and minimum values ​​in this method are very susceptible to the influence of outliers ,
so this method is less robust and is only suitable for traditional accurate small data scenarios
(robustness: it can be understood that the algorithm has a certain tolerance to data changes) how high)

standardization

Features: By transforming the original data, the data is transformed to a mean value of 0 and a variance of 1. Formula
:
insert image description here
Note: It acts on each column, that is, each feature column, mean is the average value, and σ is the standard deviation (the standard deviation is considered It is the stability of the data)
standard deviation σ = square root of the variance Variance
: (M is the average value, n is the number of data)
insert image description here
For standardization: if there are abnormal points, due to a certain amount of data, a small number of abnormal points are relatively large for the average value The effect of is not large, so the variance change is small.


sklearn characterization APl: scikit-learn.preprocessing.StandardScaler
code snippet is as follows

# from sklearn.preprocessing import StandardScaler
def stand():
    """
    标准化缩放
    """
    std = StandardScaler()
    data = std.fit_transform(
        [[1., -1., 3.],
         [2., 4., 2.],
         [4., 6., -1.]]

    )
    print(data)

insert image description here

Normalization is relatively stable when there are enough samples , and it is suitable for modern noisy big data scenarios.

Missing value handling

1. Delete directly
If the missing value of each column or row reaches a certain proportion, it is recommended to discard the entire row or column.
Generally, it is not recommended to delete the missing value directly, otherwise it will reduce the amount of data.
2. Imputation
replaces the missing value with the average value of each column , median, etc.
sklearn missing value APl: from sklearn.impute import SimpleImputer
code is as follows:

# from sklearn.impute import SimpleImputer
def im():
    """
    缺失值处理
    """
    # strategy:填充策略,可以选择平均数mean,或中位数median
    # missing_values默认为np.nan
    # 如果缺失值不是np.nan,需要先将数据替换成np.nan
    im = SimpleImputer(missing_values=np.nan, strategy="mean")
    data = im.fit_transform(
        [[1, 2],
         [np.nan, 3],
         [7, 6]]

    )
    print(data)

insert image description here

Guess you like

Origin blog.csdn.net/Edward_Legend/article/details/121256244