Data representation and feature engineering of sklearn library

This article mainly contains two parts. The first part is the transformation of feature data, which is especially useful for models with low complexity, such as linear models. The second part is feature selection.


For a specific application, how to find the best data representation is a problem called feature engineering, which is one of the main tasks of data scientists and machine learning practitioners when trying to solve real-world problems.

Feature type English name Alias English name
Numerical characteristics numerical feature Continuous feature continuous feature
Classification features categorical feature Discrete feature discrete feature

Representing the data in a correct way has a greater impact on the performance of the supervised model than the precise parameters selected.

1. Categorical variables

1.1 One-Hot encoding (dummy variable)
Encoding English name Alias ​​1 Alias ​​2
one-hot encoding one-hot-encoding N take one encoding (one-out-of-N encoding) Dummy variable

The idea behind dummy variables is to replace a categorical variable with one or more new features, and the new features take values ​​of 0 and 1.

1.2 How to deal with numerically coded categorical variables?

The get_dummies function of pandas treats all numbers as continuous and does not create dummy variables for them.

To solve this problem, you can use scikit-learn's OneHotEncoder to specify which variables are continuous and which variables are discrete. You can also convert the numeric columns in the data frame into strings.

  • Method 1: Use pd.get_dummies() to display the column to be encoded
demo_df = pd.DataFrame({
    
    'Integer Feature': [0, 1, 2, 1],'Categorical Feature': ['socks', 'fox', 'socks', 'box']})
demo_df['Integer Feature'] = demo_df['Integer Feature'].astype(str)
pd.get_dummies(demo_df, columns=['Integer Feature', 'Categorical Feature']) # 显示给出想要编码的列

Insert picture description here

  • Method 2: Use OneHotEncoder class to encode
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoder.fit(demo_df)
encoder.transform(demo_df)

2. Binning (discretization): mainly for linear models

Use feature binning (also called discretization, that is, discretization) to divide it into multiple features.

bins = np.linspace(-3, 3, 11)  # 生成[-3,3]的10个区间
which_bin = np.digitize(X, bins=bins)  # 返回相同形状的保存X位于区间的数组,区间从1计数 
  • to sum up
  1. The binning feature usually does not produce better results for tree-based models, because this model can learn to divide data at any location. In a sense, decision trees can learn how binning is most useful for predicting these data.

  2. For a specific data set, if there are good reasons to use a linear model—for example, the data set is large and the dimensionality is high, but the relationship between some features and the output is nonlinear—then binning is a good way to improve modeling capabilities.


Three, interactive features and polynomial features

To enrich the feature representation, especially for linear models, you can add the interaction feature and polynomial feature of the original data.

  • Interactive feature
X_combined = np.hstack([X, X_binned])  # 将原特征和分箱编码后的特征作为输入特征,横向拼接
X_product = np.hstack([X_binned, X * X_binned])
  • Polynomial features: implemented in PolynomialFeatures of the preprocessing module
from sklearn.preprocessing import PolynomialFeatures
# 默认的"include_bias=True"添加恒等于1的常数特征
# degree是针对所有特征
poly = PolynomialFeatures(degree=10, include_bias=False)
X = np.array([1,2]).reshape(-1,1)
poly.fit(X)
X_poly = poly.transform(X)

The semantics of features can be obtained by calling the get_feature_names method, and the index of each feature can be given

poly.get_feature_names()

Insert picture description here


Four, univariate nonlinear transformation

Most models perform best when each feature (including the target value in the regression problem) roughly follows a Gaussian distribution.

Although tree-based models only focus on the order of features, linear models and neural networks rely on the scale and distribution of each feature.

For regression problems. The log and exp functions can help adjust the relative proportion of data, thereby improving the learning effect of linear models or neural networks.

The sin and cos functions are very useful when dealing with data with periodic patterns.
Insert picture description here
This type of value distribution (many small values ​​and some very large values) in the above figure is very common in practice. Generally, the data set is transformed as follows:

X_train_log = np.log(X_train + 1)
X_test_log = np.log(X_test + 1)

After the transformation, the asymmetry of the data distribution becomes smaller, and there are no longer very large outliers.


part1 summary

  • Binning, polynomials, and interaction terms all have a great impact on the performance of the model on a given data set, especially for models with lower complexity, such as linear models and naive Bayes models.

  • Tree-based models are usually able to discover important interaction terms by themselves, and in most cases there is no need to explicitly transform the data.

  • Other models, such as SVM, nearest neighbors, and neural networks, may sometimes benefit from using binning, interaction terms, or polynomials, but their effects are usually not as obvious as linear models.


5. Automated feature selection (sklearn.feature_selection is mainly used)

When adding new features or processing general high-dimensional data sets, it is best to reduce the number of features to only include the most useful features, and delete the remaining features. This will result in a simpler model with better generalization capabilities.

Basic strategy English name
Univariate statistics univariate statistics
Model-based selection model-based selection
Iterative selection iterative selection

Function to visualize the mask (feature selection result): plt.matshow(mask.reshape(1,-1), cmap='gray_r')


5.1 Univariate Statistics

In univariate statistics, we calculate whether the relationship between each feature and the target value is statistically significant, and then select the feature with the highest confidence.

For classification problems, this is also known as analysis of variance (ANOVA). A key property of these tests is that they are univariate, meaning that they only consider each feature individually.

Method of calculating the threshold:

method Instructions for use
SelectKBest Select a fixed number of k features
SelectPercentile Select a fixed percentage of features
  • Import and instantiation of feature selection
from sklearn.feature_selection import SelectPercentile
select = SelectPercentile(percentile=50)
select.fit(X_train, y_train)

5.2 Model-based feature selection

Model-based feature selection uses a supervised machine learning model to determine the importance of each feature, and only retains the most important features.

The supervised model used for feature selection does not need to be the same as the model used for final supervised modeling.

The feature selection model needs to provide a certain importance metric for each feature in order to use this metric to rank the features.

  • Import and instantiation of feature selection
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
select = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42), threshold="median")  # 用包含 100 棵树的随机森林分类器来计算特征重要性,参数 threshold 指定阈值

5.3 Iterative feature selection

In iterative feature selection, a series of models will be constructed, each model using a different number of features. There are two basic methods:

  • There are no features at the beginning, and then add features one by one until a certain termination condition is met;

  • Start with all features, and then delete features one by one until a certain termination condition is met.

Method: recursive feature elimination (recursive feature elimination, RFE)

Instructions for use: It starts with all the features to build the model, and discards the least important features according to the model, then uses all the features except the discarded features to build a new model, and so on, until only the preset number of features are left .

  • Import and instantiation of feature selection
from sklearn.feature_selection import RFE
select = RFE(RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=40)

part2 summary

  1. Unsure when to choose which features to use as input to a machine learning algorithm, automated feature selection may be particularly useful.

  2. Feature selection can help reduce the number of required features, speed up prediction, or allow more interpretable models. In most real situations, it is unlikely that using feature selection will significantly improve performance, but it is still a very valuable tool in the feature engineering toolbox.

Guess you like

Origin blog.csdn.net/xylbill97/article/details/105947877