Feature selection (feature_selection)

Filter

Removing low variance characteristics (Removing features with low variance)
Univariate feature selection (Univariate feature selection)

Wrapper

Recursive feature elimination (Recursive Feature Elimination)

Embedded

Use SelectFromModel selection feature (Feature selection using SelectFromModel)
The feature selection process into pipeline (Feature selection as part of a pipeline)

When the data pre-processing is completed, we need to choose a meaningful input characteristics of machine learning algorithms and models for training.

Generally speaking, from the viewpoint of two selected features:

Features are divergent

If a feature does not diverge, for example, the variance is close to zero, i.e. in this sample wherein substantially no difference, the distinguishing features of the sample and no use.

Correlation with the target

This is apparent from comparison with the high correlation between the target characteristic, it should preferably be selected. In addition to removing the low-variance method, the other methods described in this article are from the relevant considerations.

The feature selection may be in the form of feature selection and divided into three kinds:

Filter: filtration, or in accordance with the divergence of the various features correlation score, a threshold or a threshold number to be selected, the selected feature.
Wrapper: packing method according to the objective function (usually a prediction effect score), wherein each select a number, or a number of negative characteristics.
Embedded: embedding, to the use of certain algorithms and machine learning model is trained weight coefficient of each feature, based on the coefficients in descending selection feature. Filter method is similar, but is determined by the characteristics of the merits of training.

Feature selection has two main purposes:

Reducing the number of features, dimension reduction, to make the model more generalization ability to reduce over-fitting;
Enhanced understanding of the features and feature values between.

Get the data set, a feature selection is often difficult to simultaneously complete the two purposes. Typically, they are most familiar, or choose a most convenient feature selection methods (often aim of dimensionality reduction, while ignoring the purpose of understanding the characteristics and data). Examples will next be provided Scikit-learn several common features selection methods, their advantages, disadvantages and problems.

Filter

1) remove feature low variance (Removing features with low variance)

Suppose a feature characteristic value only 0 and 1, and all the input samples, characterized in that 95% of the value of Example 1 is, that it can not effect this feature. If 100% is 1, then this feature no sense. When the eigenvalues are discrete variables in order to use this method, if it is continuous variables, we need to discretizing continuous variables before you can use. Also practice, generally less than 95% will have taken a characteristic value is present, so this method is simple, but not easy to use. It can be characterized as a pre-selected, to remove the small variation characteristic values, and then select the appropriate selection method further feature selection from the following features mentioned.

from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
# 方差低于此阈值的特征将被删除
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)

array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

Sure enough, VarianceThreshold feature is removed in the first column, wherein the first column is reached 0 5/6 probability.

2) univariate feature selection (Univariate feature selection)

Univariate feature selection principle is individually calculated a statistical index for each variable, according to the index to determine which variables are important to eliminate those variables unimportant.

For classification problems (y discrete), can be used:

Chi-square test
f_classif
mutual_info_classif
Mutual information

For regression problems (y continuous), may be employed:

Pearson correlation coefficient
f_regression,
mutual_info_regression
The maximum coefficient information

This method is relatively simple, easy operation, easy to understand, generally has a good effect (but optimization features, may not be effective for improve the generalization ability) to understand the data.

SelectKBest removed before all features other than k name score (Top take k)
Characterized in that after removing the score SelectPercentile specified percentage of users (taken top k%)
For each generic feature univariate statistical tests: False positive rate (false positive rate) SelectFpr, pseudo-discovery rate (false discovery rate) SelectFdr, error rate, or the family SelectFwe.
GenericUnivariateSelect can set up different strategies to univariate feature selection. At the same time different selection strategies are also able to use ultra-parameter optimization, so that we find the best univariate feature selection strategy.

Notice:
　　of The Test Methods based ON-F. The Estimate of Degree Linear dependency BETWEEN TWO Random Variables (F. Evaluation test for two random variables linear correlation) On the other hand, mutual information methods can capture any kind of statistical dependency. , but being nonparametric, they require more samples for accurate estimation. ( On the other hand, mutual information can capture any type of statistical dependence, but as a non-parametric method to estimate the need for more accurate sample)

Chi-square (Chi2) test

The classic chi-square test is a qualitative test for the qualitative correlation between the independent variable dependent variable. For example, we can conduct a chi2 test samples to select the best two features:

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target
X.shape

(150, 4)

X[:5, :]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape

(150, 2)

X_new[:5, :]

array([[1.4, 0.2],
       [1.4, 0.2],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.4, 0.2]])

f_classif
ANOVA F-value between label/feature for classification tasks.
mutual_info_classif
Mutual information for a discrete target.
chi2
Chi-squared stats of non-negative features for classification tasks.
f_regression
F-value between label/feature for regression tasks.
mutual_info_regression
Mutual information for a continuous target.
SelectPercentile
Select features based on percentile of the highest scores.
SelectFpr
Select features based on a false positive rate test.
SelectFdr
Select features based on an estimated false discovery rate.
SelectFwe
Select features based on family-wise error rate.
GenericUnivariateSelect
Univariate feature selector with configurable mode.

Pearson correlation coefficient (Pearson Correlation)

Pearson's correlation coefficient is one of the simplest, to help understand the relationship between the variables and the response characteristics, the method measures the linear correlation between the variables, the result is a value interval [-1,1], -1 indicates perfect negative correlation, + 1 represents perfect positive correlation, 0 indicates no linear correlation.

import numpy as np
from scipy.stats import pearsonr
np.random.seed(0)
size = 300
x = np.random.normal(0, 1, size)
# pearsonr(x, y)的输入为特征矩阵和目标向量，能够同时计算 相关系数 和p-value.
print("Lower noise", pearsonr(x, x + np.random.normal(0, 1, size)))
print("Higher noise", pearsonr(x, x + np.random.normal(0, 10, size)))

Lower noise (0.7182483686213841, 7.32401731299835e-49)
Higher noise (0.057964292079338155, 0.3170099388532475)

In this example, we compared the difference after adding noise and variable before. When the noise is relatively small, strong correlation, p-value is very low.

We used Pearson correlation coefficient is mainly to see the correlation between the features, but not between and dependent variables.

Wrapper

Recursive feature elimination (Recursive Feature Elimination)

A method using a recursive feature elimination group model multiple rounds of training, training after each round, to remove several characteristic values of the coefficients right, then the next round of training based on the new feature set.

Characterized in containing the right to re-predictive model (e.g., corresponding to a linear model parameter coefficients), RFE selected set of features characterized by recursively reducing the size of the inspection. First, the prediction model on the original features of the training, each feature to specify a weight. Thereafter, characterized in that the absolute value of the minimum weight has been kicked feature set. Recursively and so forth, until the number reaches the number of remaining features required features.

RFECV RFE executed by cross validation in order to select the optimal number of features: d for a number of the feature set, the number of all of his subset of the d th power of 2 minus 1 (empty set comprising ). Specify an external learning algorithms such as SVM and the like. Calculating a subset of all the algorithms by validation error. Selecting the smallest error is selected as a subset of that feature.

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

rf = RandomForestClassifier()
iris=load_iris()
X,y=iris.data,iris.target
print(X.shape)
print(X[:5, :])

(150, 4)
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]

# 使用递归特征消除进行特征排序
rfe = RFE(estimator=rf, n_features_to_select=3)
X_rfe = rfe.fit_transform(X,y)
print(X_rfe.shape)
print(X_rfe[:5, :])

(150, 3)
[[5.1 1.4 0.2]
 [4.9 1.4 0.2]
 [4.7 1.3 0.2]
 [4.6 1.5 0.2]
 [5.  1.4 0.2]]


E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\ensemble\forest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\ensemble\forest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

Embedded

Use SelectFromModel selection feature (Feature selection using SelectFromModel)

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

clf = RandomForestClassifier()
iris=load_iris()
X,y=iris.data,iris.target
print(X.shape)
print(X[:5, :])
sfm = SelectFromModel(clf, threshold=0.25)
X_sfm = sfm.fit_transform(X,y)
print(X_sfm.shape)
print(X_sfm[:5, :])

(150, 4)
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
(150, 2)
[[1.4 0.2]
 [1.4 0.2]
 [1.3 0.2]
 [1.5 0.2]
 [1.4 0.2]]


E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\ensemble\forest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

Feature selection (L1-based feature selection) L1 based

L1 norm using a linear model (Linear models) will be penalized sparse solution: Most characteristic coefficient corresponding to 0. When you want to reduce the dimension of feature classifiers for other time, you can not be selected by feature_selection.SelectFromModel coefficient of 0.

In particular, sparse prediction model commonly used for this purpose are linear_model.Lasso (return), linear_model.LogisticRegression and svm.LinearSVC (classification)

from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC
print(X.shape)
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X,y)
model = SelectFromModel(lsvc, prefit=True)
X_embed = model.transform(X)
X_embed.shape

(150, 4)





(150, 3)

X_embed[:5,:]

array([[5.1, 3.5, 1.4],
       [4.9, 3. , 1.4],
       [4.7, 3.2, 1.3],
       [4.6, 3.1, 1.5],
       [5. , 3.6, 1.4]])

So we prefer to work in what way?

First, let's review what we in the business model will be any problems.

Model results are not
Training set effective, cross-time test results are not
Test results across time or, after a bad effect on the line
After the line better results, distribution of scores began to decline after a few weeks
Within a month or two are relatively stable, suddenly plunged distribution of scores
No obvious problems, but the gradual failure model every month

Then we come to think about what business variables that need.

Variables must contribute to the model, that must be able to distinguish between customer base
Linear regression between the logical requirements of independent variables
Logistic regression scorecard also want the variable monotone trend (partly also for business reasons, but the model point of view, not necessarily better than the monotonous variable variables turning point)
Distribution of stable customer base in each of the variables, the distribution of migration is inevitable, but not fluctuate too much

To this end we have found several ways to best fit current usage scenario from the above method.

import pandas as pd
import numpy as np
df_train = pd.read_csv('train.csv')
df_train.head()

	PassengerId	label	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. A loan	female	26.0	0	STON / O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

1) Variable Importance

IV value
Chi-square test
Screening Model

Here we use an IV, or a little more screening model

IV is actually a plus in front of WOE.

$p_{y_i}=\frac{y_i}{y_T}$
$p_{n_i}=\frac{n_i}{n_T}$
$woe_i = ln(\frac{p_{y_i}}{p_{n_i}})$
$iv_i = (p_{y_i} - p_{n_i}) \times woe_i$

Finally, just to iv for each interval added together to derive the total value of iv:
$$ IV = \ SUM iv_i $$

import math
a = 0.4  
b = 0.6
iv = (a - b) * math.log(a / b)
iv

0.08109302162163284

Or integrated model output characteristics of importance:

# lightGBM中的特征重要性
feature = pd.DataFrame(
            {'name' : model.booster_.feature_name(),
            'importance' : model.feature_importances_
          }).sort_values(by =  ['importance'],ascending = False)

2) co-linear

The correlation coefficient COR
VIF variance coefficient of expansion

Do a lot of model-based space division thought, we must pay attention to the correlation between variables. Look separate the two variables we will use the Pearson correlation coefficient.

df_train.corr()

	PassengerId	label	Pclass	Age	SibSp	respect	Fare
PassengerId	1.000000	-0.005007	-0.035144	0.036847	-0.057527	-0.001652	0.012658
label	-0.005007	1.000000	-0.338481	-0.077221	-0.035322	0.081629	0.257307
Pclass	-0.035144	-0.338481	1.000000	-0.369226	0.083081	0.018443	-0.549500
Age	0.036847	-0.077221	-0.369226	1.000000	-0.308247	-0.189119	0.096067
SibSp	-0.057527	-0.035322	0.083081	-0.308247	1.000000	0.414838	0.159651
respect	-0.001652	0.081629	0.018443	-0.189119	0.414838	1.000000	0.216225
Fare	0.012658	0.257307	-0.549500	0.096067	0.159651	0.216225	1.000000

import seaborn as sns
sns.set(color_codes=True)
np.random.seed(sum(map(ord, "distributions")))
# 在数据集中绘制成对关系
sns.pairplot(df_train) # 对角线上是单维度分布

<seaborn.axisgrid.PairGrid at 0xe7abfa160>

In multiple regression, we can be verified by calculating the variance expansion coefficient regression model VIF whether there is a serious multicollinearity problem. definition:

$$VIF = \frac{1}{1-R^2}$$

Which, $ R_i $ as independent variables for negative correlation coefficient regression analysis of the remaining arguments. Variance coefficient of expansion tolerance $ 1-R ^ 2 $ reciprocal.

VIF expansion coefficient of variance indicates greater likelihood of the presence of collinearity between the independent variables. Generally speaking, if the variance inflation factor exceeds 10, the regression model serious multicollinearity. According to yet Hair (1995) collinearity criteria, when the argument is greater than 0.1 tolerance variance range expansion coefficient of less than 10 are acceptable, no clear presence table collinearity between variables.

VIF function detailed usage can be seen statsmodels official documents .

from statsmodels.stats.outliers_influence import variance_inflation_factor
import numpy as np

data = [[1,2,3,4,5],
        [2,4,6,8,9],
        [1,1,1,1,1],
       [2,4,6,4,7]]
X = np.array(data).T

variance_inflation_factor(X,0)

98.33333333333381

3) Monotonicity

bivar map

# 等频切分
df_train.loc[:,'fare_qcut'] = pd.qcut(df_train['Fare'], 10)
print(df_train.head())
df_train = df_train.sort_values('Fare')
alist = list(set(df_train['fare_qcut']))
badrate = {}
for x in alist:
    
    a = df_train[df_train.fare_qcut == x]
    
    bad = a[a.label == 1]['label'].count()
    good = a[a.label == 0]['label'].count()
    
    badrate[x] = bad/(bad+good)
badrate

     PassengerId  label  Pclass                             Name   Sex   Age  \
271          272      1       3     Tornquist, Mr. William Henry  male  25.0   
277          278      0       2      Parkes, Mr. Francis "Frank"  male   NaN   
263          264      0       1            Harrison, Mr. William  male  40.0   
597          598      0       3              Johnson, Mr. Alfred  male  49.0   
302          303      0       3  Johnson, Mr. William Cahoone Jr  male  19.0   

     SibSp  Parch  Ticket  Fare Cabin Embarked       fare_qcut  
271      0      0    LINE   0.0   NaN        S  (-0.001, 7.55]  
277      0      0  239853   0.0   NaN        S  (-0.001, 7.55]  
263      0      0  112059   0.0   B94        S  (-0.001, 7.55]  
597      0      0    LINE   0.0   NaN        S  (-0.001, 7.55]  
302      0      0    LINE   0.0   NaN        S  (-0.001, 7.55]  





{Interval(39.688, 77.958, closed='right'): 0.5280898876404494,
 Interval(14.454, 21.679, closed='right'): 0.42045454545454547,
 Interval(7.55, 7.854, closed='right'): 0.2988505747126437,
 Interval(8.05, 10.5, closed='right'): 0.23076923076923078,
 Interval(10.5, 14.454, closed='right'): 0.42857142857142855,
 Interval(77.958, 512.329, closed='right'): 0.7586206896551724,
 Interval(-0.001, 7.55, closed='right'): 0.14130434782608695,
 Interval(27.0, 39.688, closed='right'): 0.37362637362637363,
 Interval(7.854, 8.05, closed='right'): 0.1792452830188679,
 Interval(21.679, 27.0, closed='right'): 0.5168539325842697}

f = zip(badrate.keys(),badrate.values())
f = sorted(f,key = lambda x : x[1],reverse = True )
badrate = pd.DataFrame(f)
badrate

	0	1
0	(77.958, 512.329]	0.758621
1	(39.688, 77.958]	0.528090
2	(21.679, 27.0]	0.516854
3	(10.5, 14.454]	0.428571
4	(14.454, 21.679]	0.420455
5	(27.0, 39.688]	0.373626
6	(7.55, 7.854]	0.298851
7	(8.05, 10.5]	0.230769
8	(7.854, 8.05]	0.179245
9	(-0.001, 7.55]	0.141304

badrate.columns = pd.Series(['cut','badrate'])
badrate = badrate.sort_values('cut')
print('===============================================')
print(badrate)
badrate.plot('cut','badrate')

===============================================
                 cut   badrate
9     (-0.001, 7.55]  0.141304
6      (7.55, 7.854]  0.298851
8      (7.854, 8.05]  0.179245
7       (8.05, 10.5]  0.230769
3     (10.5, 14.454]  0.428571
4   (14.454, 21.679]  0.420455
2     (21.679, 27.0]  0.516854
5     (27.0, 39.688]  0.373626
1   (39.688, 77.958]  0.528090
0  (77.958, 512.329]  0.758621





<matplotlib.axes._subplots.AxesSubplot at 0xe018a4550>

4）稳定性

PSI
跨时间交叉检验

跨时间交叉检验

就是将样本按照月份切割，一次作为训练集和测试集来训练模型，取进入模型的变量之间的交集，但是要小心共线特征！

解决方法

不需要每次都进入模型，大部分都在即可
先去除共线性（这也是为什么集成模型我们也会去除共线性）

群体稳定性指标(population stability index)

公式：

$$ PSI = \sum{(实际占比-预期占比)*{\ln(\frac{实际占比}{预期占比})}}$$

来自知乎的例子：
比如训练一个logistic回归模型，预测时候会有个概率输出p。
你测试集上的输出设定为p1吧，将它从小到大排序后10等分，如0-0.1,0.1-0.2,......。
现在你用这个模型去对新的样本进行预测，预测结果叫p2,按p1的区间也划分为10等分。
实际占比就是p2上在各区间的用户占比，预期占比就是p1上各区间的用户占比。
意义就是如果模型跟稳定，那么p1和p2上各区间的用户应该是相近的，占比不会变动很大，也就是预测出来的概率不会差距很大。
一般认为psi小于0.1时候模型稳定性很高，0.1-0.25一般，大于0.25模型稳定性差，建议重做。

def var_PSI(dev_data, val_data):
    dev_cnt, val_cnt = sum(dev_data), sum(val_data)
    if dev_cnt * val_cnt == 0:
        return None
    PSI = 0
    for i in range(len(dev_data)):
        dev_ratio = dev_data[i] / dev_cnt
        val_ratio = val_data[i] / val_cnt + 1e-10
        psi = (dev_ratio - val_ratio) * math.log(dev_ratio/val_ratio)
        PSI += psi
    return PSI

注意分箱的数量将会影响着变量的PSI值。

PSI并不只可以对模型来求，对变量来求也一样。只需要对跨时间分箱的数据分别求PSI即可。

Features works (on)

Feature selection (feature_selection)

Filter

Wrapper

Embedded

Features are divergent

Correlation with the target

Filter

1) remove feature low variance (Removing features with low variance)

2) univariate feature selection (Univariate feature selection)

Chi-square (Chi2) test

Pearson correlation coefficient (Pearson Correlation)

Wrapper

Recursive feature elimination (Recursive Feature Elimination)

Embedded

Feature selection (L1-based feature selection) L1 based

So we prefer to work in what way?

跨时间交叉检验

群体稳定性指标(population stability index)

Guess you like