[Sklearn] Data preprocessing + feature engineering

Caicai's machine learning tutorial notes

Data mining process:

  1. retrieve data
  2. Data preprocessing
    Data preprocessing is the process of detecting, correcting or deleting damaged, inaccurate or unsuitable records from the data.
    The problems that may be faced are: different data types, for example, some are text, some are numbers, and some contain time Sequence, some continuous, some discontinuous.
    It is also possible that the quality of the data is not good, there is noise, there are abnormalities, there are missing, the data is wrong, the dimensions are different, there is duplication, the data is skewed, the amount of data is too
    large or too small
    . The purpose of data preprocessing: to adapt the data to the model , Matching the needs of the model
  3. Feature Engineering
    Feature engineering is the process of transforming original data into features that better represent the potential problems of the prediction model. It can be achieved by selecting the most relevant features, extracting
    features, and creating features. Among them, the creation of features is often implemented by means of dimensionality reduction algorithms.
    The problems that may be faced are: there are correlations between the features, the features are not related to the label, the features are too many or too small, or they simply cannot show the proper
    data phenomenon or cannot show the true appearance of the data.
    The purpose of the feature project: 1) Reduce computational cost, 2) Increase model upper limit
  4. Model, test the model and predict the result
  5. Go online to verify the model effect

1. Data preprocessing Preprocessing & Impute

1.1 Data dimensionless

In the practice of machine learning algorithms, we often have the need to convert data of different specifications to the same specification, or data from different distributions to a specific distribution. This requirement is collectively referred to as " dimensionless " data .

Data normalization preprocessing.MinMaxScaler

When the data (x) is centered according to the minimum value, and then scaled by the range (maximum value-minimum value), the data moves by the minimum value unit and will be converged to between [0,1], and this process, It is called data normalization (Normalization, also known as Min-Max Scaling). Note that Normalization is normalization, not regularization. The real regularization is regularization, not a means of data preprocessing. The normalized data obeys a normal distribution, and the formula is as follows:
Insert picture description here
In sklearn, we use preprocessing.MinMaxScaler to achieve this function. MinMaxScaler has an important parameter, feature_range, which controls the range we want to compress the data to. The default is [0,1].

scaler = MinMaxScaler(feature_range=[5,10]) #实例化
result = scaler.fit_transform(data) #fit_transform一步导出结果
from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
#不太熟悉numpy的小伙伴,能够判断data的结构吗?
#如果换成表是什么样子?
import pandas as pd
pd.DataFrame(data) #实现归一化
scaler = MinMaxScaler() #实例化
scaler = scaler.fit(data) #fit,在这里本质是生成min(x)和max(x)
result = scaler.transform(data) #通过接口导出结果
result
result_ = scaler.fit_transform(data) #训练和导出结果一步达成
scaler.inverse_transform(result) #将归一化后的结果逆转

#使用MinMaxScaler的参数feature_range实现将数据归一化到[0,1]以外的范围中
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler(feature_range=[5,10]) #依然实例化
result = scaler.fit_transform(data) #fit_transform一步导出结果
result
#当X中的特征数量非常多的时候,fit会报错并表示,数据量太大了我计算不了
#此时使用partial_fit作为训练接口
#scaler = scaler.partial_fit(data)

Data standardization preprocessing.StandardScaler

When the data (x) is centered by the mean (μ) and then scaled by the standard deviation (σ), the data will obey a normal distribution with a mean of 0 and a variance of 1 (ie, standard normal distribution), and this process , Is called Data Standardization (Standardization, also known as Z-score normalization), the formula is as follows:
Insert picture description here

scaler = StandardScaler() #实例化
scaler.fit_transform(data) #使用fit_transform(data)一步达成结果
from sklearn.preprocessing import StandardScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = StandardScaler() #实例化
scaler.fit(data) #fit,本质是生成均值和方差
scaler.mean_ #查看均值的属性mean_
scaler.var_ #查看方差的属性var_
x_std = scaler.transform(data) #通过接口导出结果
x_std.mean() #导出的结果是一个数组,用mean()查看均值
x_std.std() #用std()查看方差
scaler.fit_transform(data) #使用fit_transform(data)一步达成结果
scaler.inverse_transform(x_std) #使用inverse_transform逆转标准化

Which one to choose

It depends on the situation. In most machine learning algorithms, StandardScaler will be selected for feature scaling, because MinMaxScaler is very sensitive to outliers. Among the algorithms of PCA, clustering, logistic regression, support vector machines, and neural networks, StandardScaler is often the best choice .

MinMaxScaler is widely used when it does not involve distance measurement, gradient, covariance calculation, and data needs to be compressed to a specific interval . For example , when quantizing pixel intensity in digital image processing, MinMaxScaler is used to compress the data in the interval [0,1].

It is recommended to try the StandardScaler first, the effect is not good and change to MinMaxScaler.

1.2 missing values

impute.SimpleImputer

sklearn.impute.SimpleImputer(missing_values=nan, strategy=’mean’, fill_value=None, verbose=0,copy=True)
Insert picture description here

data.info()
#填补年龄
Age = data.loc[:,"Age"].values.reshape(-1,1) #sklearn当中特征矩阵必须是二维
Age[:20]
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer() #实例化,默认均值填补
imp_median = SimpleImputer(strategy="median") #用中位数填补
imp_0 = SimpleImputer(strategy="constant",fill_value=0) #用0填补
imp_mean = imp_mean.fit_transform(Age) #fit_transform一步完成调取结果
imp_median = imp_median.fit_transform(Age)
imp_0 = imp_0.fit_transform(Age)
imp_mean[:20]
imp_median[:20]
imp_0[:20] #在这里我们使用中位数填补Age
data.loc[:,"Age"] = imp_median
data.info()
#使用众数填补Embarked
Embarked = data.loc[:,"Embarked"].values.reshape(-1,1)
imp_mode = SimpleImputer(strategy = "most_frequent")
data.loc[:,"Embarked"] = imp_mode.fit_transform(Embarked)
data.info()

Use numpy:

data.loc[:,"Age"] = data.loc[:,"Age"].fillna(data.loc[:,"Age"].median())
#.fillna 在DataFrame里面直接进行填补
data.dropna(axis=0,inplace=True)
#.dropna(axis=0)删除所有有缺失值的行,.dropna(axis=1)删除所有有缺失值的列
#参数inplace,为True表示在原数据集上进行修改,为False表示生成一个复制对象,不修改原数据,默认False

1.3 Processing sub-type characteristics: coding and dummy variables

Encoding: Convert text-based data to numeric data

preprocessing.LabelEncoder: dedicated to labels, able to convert classification to classification values

(One-dimensional label)

from sklearn.preprocessing import LabelEncoder
data.iloc[:,-1] = LabelEncoder().fit_transform(data.iloc[:,-1])

preprocessing.OrdinalEncoder: feature-specific, can convert categorical features into categorical values

(Multidimensional feature matrix)

from sklearn.preprocessing import OrdinalEncoder
#接口categories_对应LabelEncoder的接口classes_,一模一样的功能
data_ = data.copy()
data_.head()
OrdinalEncoder().fit(data_.iloc[:,1:-1]).categories_
data_.iloc[:,1:-1] = OrdinalEncoder().fit_transform(data_.iloc[:,1:-1])
data_.head()

preprocessing.OneHotEncoder: One-hot encoding, creating dummy variables

It is suitable for nominal variables without calculation properties, using one-hot encoding to convert them into dummy variables.
Insert picture description here

data.head()
from sklearn.preprocessing import OneHotEncoder
X = data.iloc[:,1:-1]
enc = OneHotEncoder(categories='auto').fit(X)
result = enc.transform(X).toarray()
result
#依然可以直接一步到位,但为了给大家展示模型属性,所以还是写成了三步
OneHotEncoder(categories='auto').fit_transform(X).toarray()
#依然可以还原
pd.DataFrame(enc.inverse_transform(result))
enc.get_feature_names()
result
result.shape
#axis=1,表示跨行进行合并,也就是将量表左右相连,如果是axis=0,就是将量表上下相连
newdata = pd.concat([data,pd.DataFrame(result)],axis=1)
newdata.head()
newdata.drop(["Sex","Embarked"],axis=1,inplace=True)
newdata.columns = 
["Age","Survived","Female","Male","Embarked_C","Embarked_Q","Embarked_S"]

2. Feature Engineering

Insert picture description here

2.1Filter method

Insert picture description here

2.1.1 Variance filtering

This is to filter the class of features by the variance of the feature itself. For example, if the variance of a feature itself is very small, it means that the sample has basically no difference in this feature. It is possible that most of the values ​​in the feature are the same, or even the value of the entire feature is the same, then this feature has no effect on sample differentiation. So no matter what the next feature project is to do, the feature with zero variance must be eliminated first . VarianceThreshold has an important parameter threshold , which represents the threshold of variance, which means to discard all features with variance less than threshold. If it is not filled, the default value is 0, which means that all records with the same feature are deleted.

from sklearn.feature_selection import Variancethreshold
selector = VarianceThreshold() #实例化,不填参数默认方差为0
X_var0 = selector.fit_transform(X) #获取删除不合格特征之后的新特征矩阵
#也可以直接写成 X = VairanceThreshold().fit_transform(X)

It can be seen that we have deleted the features with a variance of 0, but there are still more than 708 features, and it is obvious that further feature selection is needed. However, if we know how many features we need, variance can also help us select features in one step. For example, if we want to leave half of the features, we can set a variance threshold that halves the total number of features. Just find the median of the variance of the features, and then
enter this median as the value of the parameter threshold:

import numpy as np
X_fsvar = VarianceThreshold(np.median(X.var().values)).fit_transform(X) X.var().values
np.median(X.var().values)
X_fsvar.shape

2.1.2 Relevance filtering

After selecting the variance, we have to consider the next issue: correlation. We hope to select features that are relevant and meaningful to the label, because such features can provide us with a lot of information. If the feature has nothing to do with the label, it will only waste our computational memory and may also bring noise to the model. In sklearn, we have three commonly used methods to judge the correlation between features and labels: Chi-square, F-test, and mutual information .

2.1.2.1 Chi-square filter

Chi-square filtering is a relevance filtering specifically for **discrete tags (that is, classification problems)**.

from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
#假设在这里我一直我需要300个特征
X_fschi = SelectKBest(chi2, k=300).fit_transform(X_fsvar, y)
X_fschi.shape
How to choose the value of k-see p value and choose k

The essence of the chi-square test is to infer the difference between the two sets of data. The null hypothesis of the test is "the two sets of data are independent of each other." The chi-square test returns two statistics, the chi-square value and the p-value. Among them, the chi-square value is difficult to define the effective range, and for the p-value, we generally use 0.01 or 0.05 as the significance level, that is, the boundary of the p-value judgment.
Insert picture description here
From the perspective of feature engineering, we hope to select features with a large chi-square value and a p-value less than 0.05, that is, the feature associated with the label.
Before calling SelectKBest, we can directly obtain the chi-square value and P value corresponding to each feature from the model instantiated by chi2.

chivalue, pvalues_chi = chi2(X_fsvar,y)
chivalue
pvalues_chi
#k取多少?我们想要消除所有p值大于设定值,比如0.05或0.01的特征:
k = chivalue.shape[0] - (pvalues_chi > 0.05).sum()
#X_fschi = SelectKBest(chi2, k=填写具体的k).fit_transform(X_fsvar, y)
#cross_val_score(RFC(n_estimators=10,random_state=0),X_fschi,y,cv=5).mean()

It can be observed that the p value of all features is 0, which means that for the data set of digit recognizor, the variance verification has eliminated all the features that are not related to the label, or the data set itself does not contain anything that is not related to the label. feature. In this case, discarding any feature will discard the information that is useful to the model and reduce the model's performance. Therefore, when we are satisfied with the calculation speed, we do not need to use correlation filtering to filter our data. If we think the calculation speed is too slow, then we can delete some features as appropriate, but the premise is that we must sacrifice the performance of the model. Next, let's try to use other correlation filtering methods to verify our conclusions on this data set.

2.1.2.2F inspection

F test, also known as ANOVA, test for homogeneity of variance, is a filtering method used to capture the linear relationship between each feature and the label . It can do both regression and classification, so it contains feature_selection.f_classif (F test classification) and feature_selection.f_regression (F test regression) two classes. F-test classification is used for data whose labels are discrete variables, and F-test regression is used for data whose labels are continuous variables.

from sklearn.feature_selection import f_classif
F, pvalues_f = f_classif(X_fsvar,y) F
pvalues_f
k = F.shape[0] - (pvalues_f > 0.05).sum()
#X_fsF = SelectKBest(f_classif, k=填写具体的k).fit_transform(X_fsvar, y)
#cross_val_score(RFC(n_estimators=10,random_state=0),X_fsF,y,cv=5).mean()

The conclusion obtained is exactly the same as that obtained by chi-square filtering: no feature has a p-value greater than 0.01, and all features are related to labels, so we do not need correlation filtering.

2.1.2.3 Mutual information

Mutual information method is a filtering method used to capture any relationship ( including linear and non-linear relationship ) between each feature and label . And F test is similar, it can do both can do regression classification , and contains two classes feature_selection.mutual_info_classif (mutual information classification) and
feature_selection.mutual_info_regression (mutual information return). The usage and parameters of these two classes are exactly the same as the F test, but the mutual information method is more powerful than the F test. The F test can only find linear relationships, while the mutual information method can find arbitrary relationships . Mutual information method does not return statistics similar to p-value or F-value. It returns "estimation of the amount of mutual information between each feature and target". This estimator takes a value between [0,1] and is 0. Indicates that the two variables are independent, and 1 means that the two variables are completely correlated. The code for mutual information classification as an example is as follows:

from sklearn.feature_selection import mutual_info_classif as MIC
result = MIC(X_fsvar,y) k = result.shape[0] - sum(result <= 0)
#X_fsmic = SelectKBest(MIC, k=填写具体的k).fit_transform(X_fsvar, y)
#cross_val_score(RFC(n_estimators=10,random_state=0),X_fsF,y,cv=5).mean()

The mutual information of all features is estimated to be greater than 0, so all features are related to labels.

2.2 Embedding method

The embedding method is a method that allows the algorithm to decide which features to use, that is, feature selection and algorithm training are performed at the same time.
Insert picture description here

2.3 Wrapper packaging method

The packaging method is also a method of feature selection and algorithm training at the same time. It is very similar to the embedding method. It also relies on the selection of the algorithm itself, such as coef_attribute or feature_importances_attribute to complete feature selection. But the difference is that we often use an objective function as a black box to help
us select features, instead of inputting a certain evaluation index or statistic threshold.
Insert picture description here

Guess you like

Origin blog.csdn.net/qq_45617555/article/details/112308688