This article mainly contains two parts. The first part is the transformation of feature data, which is especially useful for models with low complexity, such as linear models. The second part is feature selection.
Article Directory
For a specific application, how to find the best data representation is a problem called feature engineering, which is one of the main tasks of data scientists and machine learning practitioners when trying to solve real-world problems.
Feature type | English name | Alias | English name |
---|---|---|---|
Numerical characteristics | numerical feature | Continuous feature | continuous feature |
Classification features | categorical feature | Discrete feature | discrete feature |
Representing the data in a correct way has a greater impact on the performance of the supervised model than the precise parameters selected.
1. Categorical variables
1.1 One-Hot encoding (dummy variable)
Encoding | English name | Alias 1 | Alias 2 |
---|---|---|---|
one-hot encoding | one-hot-encoding | N take one encoding (one-out-of-N encoding) | Dummy variable |
The idea behind dummy variables is to replace a categorical variable with one or more new features, and the new features take values of 0 and 1.
1.2 How to deal with numerically coded categorical variables?
The get_dummies function of pandas treats all numbers as continuous and does not create dummy variables for them.
To solve this problem, you can use scikit-learn's OneHotEncoder to specify which variables are continuous and which variables are discrete. You can also convert the numeric columns in the data frame into strings.
- Method 1: Use pd.get_dummies() to display the column to be encoded
demo_df = pd.DataFrame({
'Integer Feature': [0, 1, 2, 1],'Categorical Feature': ['socks', 'fox', 'socks', 'box']})
demo_df['Integer Feature'] = demo_df['Integer Feature'].astype(str)
pd.get_dummies(demo_df, columns=['Integer Feature', 'Categorical Feature']) # 显示给出想要编码的列
- Method 2: Use OneHotEncoder class to encode
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoder.fit(demo_df)
encoder.transform(demo_df)
2. Binning (discretization): mainly for linear models
Use feature binning (also called discretization, that is, discretization) to divide it into multiple features.
bins = np.linspace(-3, 3, 11) # 生成[-3,3]的10个区间
which_bin = np.digitize(X, bins=bins) # 返回相同形状的保存X位于区间的数组,区间从1计数
- to sum up
-
The binning feature usually does not produce better results for tree-based models, because this model can learn to divide data at any location. In a sense, decision trees can learn how binning is most useful for predicting these data.
-
For a specific data set, if there are good reasons to use a linear model—for example, the data set is large and the dimensionality is high, but the relationship between some features and the output is nonlinear—then binning is a good way to improve modeling capabilities.
Three, interactive features and polynomial features
To enrich the feature representation, especially for linear models, you can add the interaction feature and polynomial feature of the original data.
- Interactive feature
X_combined = np.hstack([X, X_binned]) # 将原特征和分箱编码后的特征作为输入特征,横向拼接
X_product = np.hstack([X_binned, X * X_binned])
- Polynomial features: implemented in PolynomialFeatures of the preprocessing module
from sklearn.preprocessing import PolynomialFeatures
# 默认的"include_bias=True"添加恒等于1的常数特征
# degree是针对所有特征
poly = PolynomialFeatures(degree=10, include_bias=False)
X = np.array([1,2]).reshape(-1,1)
poly.fit(X)
X_poly = poly.transform(X)
The semantics of features can be obtained by calling the get_feature_names method, and the index of each feature can be given
poly.get_feature_names()
Four, univariate nonlinear transformation
Most models perform best when each feature (including the target value in the regression problem) roughly follows a Gaussian distribution.
Although tree-based models only focus on the order of features, linear models and neural networks rely on the scale and distribution of each feature.
For regression problems. The log and exp functions can help adjust the relative proportion of data, thereby improving the learning effect of linear models or neural networks.
The sin and cos functions are very useful when dealing with data with periodic patterns.
This type of value distribution (many small values and some very large values) in the above figure is very common in practice. Generally, the data set is transformed as follows:
X_train_log = np.log(X_train + 1)
X_test_log = np.log(X_test + 1)
After the transformation, the asymmetry of the data distribution becomes smaller, and there are no longer very large outliers.
part1 summary
-
Binning, polynomials, and interaction terms all have a great impact on the performance of the model on a given data set, especially for models with lower complexity, such as linear models and naive Bayes models.
-
Tree-based models are usually able to discover important interaction terms by themselves, and in most cases there is no need to explicitly transform the data.
-
Other models, such as SVM, nearest neighbors, and neural networks, may sometimes benefit from using binning, interaction terms, or polynomials, but their effects are usually not as obvious as linear models.
5. Automated feature selection (sklearn.feature_selection is mainly used)
When adding new features or processing general high-dimensional data sets, it is best to reduce the number of features to only include the most useful features, and delete the remaining features. This will result in a simpler model with better generalization capabilities.
Basic strategy | English name |
---|---|
Univariate statistics | univariate statistics |
Model-based selection | model-based selection |
Iterative selection | iterative selection |
Function to visualize the mask (feature selection result): plt.matshow(mask.reshape(1,-1), cmap='gray_r')
5.1 Univariate Statistics
In univariate statistics, we calculate whether the relationship between each feature and the target value is statistically significant, and then select the feature with the highest confidence.
For classification problems, this is also known as analysis of variance (ANOVA). A key property of these tests is that they are univariate, meaning that they only consider each feature individually.
Method of calculating the threshold:
method | Instructions for use |
---|---|
SelectKBest | Select a fixed number of k features |
SelectPercentile | Select a fixed percentage of features |
- Import and instantiation of feature selection
from sklearn.feature_selection import SelectPercentile
select = SelectPercentile(percentile=50)
select.fit(X_train, y_train)
5.2 Model-based feature selection
Model-based feature selection uses a supervised machine learning model to determine the importance of each feature, and only retains the most important features.
The supervised model used for feature selection does not need to be the same as the model used for final supervised modeling.
The feature selection model needs to provide a certain importance metric for each feature in order to use this metric to rank the features.
- Import and instantiation of feature selection
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
select = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42), threshold="median") # 用包含 100 棵树的随机森林分类器来计算特征重要性,参数 threshold 指定阈值
5.3 Iterative feature selection
In iterative feature selection, a series of models will be constructed, each model using a different number of features. There are two basic methods:
-
There are no features at the beginning, and then add features one by one until a certain termination condition is met;
-
Start with all features, and then delete features one by one until a certain termination condition is met.
Method: recursive feature elimination (recursive feature elimination, RFE)
Instructions for use: It starts with all the features to build the model, and discards the least important features according to the model, then uses all the features except the discarded features to build a new model, and so on, until only the preset number of features are left .
- Import and instantiation of feature selection
from sklearn.feature_selection import RFE
select = RFE(RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=40)
part2 summary
-
Unsure when to choose which features to use as input to a machine learning algorithm, automated feature selection may be particularly useful.
-
Feature selection can help reduce the number of required features, speed up prediction, or allow more interpretable models. In most real situations, it is unlikely that using feature selection will significantly improve performance, but it is still a very valuable tool in the feature engineering toolbox.