Wherein Engineering - screened for binding Scikit-learn feature several common feature selection

Characterized in screening methods include: Filter (filtration), Wrapper (packaging method), Embedded (embedding)

filter: filtration

A feature selection method: removing small variation characteristic values (Removing features with low variance)

Method # simple but not easy to use, it can be characterized as a pre-selected, to remove the small variation characteristic values
if enough machine resources, and to preserve all of the information desired to be higher than the set threshold value, or only filtered discrete characteristic values of only one feature.
Discrete variables: 95% of the features are an example of the value of 1, it can be considered small role in the feature. If 100% is 1, then this feature no sense.
Continuous variables: abandon features that variance is less than a certain threshold.

from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)

 

Implementation two feature selection: univariate feature selection

Univariate independent feature selection method to measure the relationship between the response variable and each feature, feature selection can be performed univariate tests for each feature,
to measure the relationship between the feature and the response variable, based on the scores discarded bad features . The method is simple, easy operation, easy to understand,
generally has a good effect (but optimization features, may not be effective for improve the generalization ability) to understand data
univariate feature selection may be used to understand the data structure, data, characteristics It can also be used to exclude irrelevant features, but can not find it redundant features.

  1.Pearson correlation coefficient (Pearson Correlation) - mainly used for continuous screening feature, sensitive only linear

AS NP numpy Import 
from scipy.stats Import pearsonr 

np.random.seed (2019) 
size = 1000 
X = np.random.normal (0,. 1, size) 
# calculates a correlation coefficient between two variables 
print ( "Lower noise {} ".format (pearsonr (X, X + np.random.normal (0,. 1, size)))) 
Print (" {} Higher Noise. "the format (pearsonr (X, X + np.random.normal (0, 10, size))))

  2. The maximum mutual information and the coefficient information (Mutual information and maximal information coefficient)

The entropy H (Y) and the conditional entropy H (Y | X) is referred to as the difference between the mutual information - used to select only discrete features, is very sensitive to the discrete manner
since the mutual information is not easy to directly to the feature selection, so the introduction of the maximum coefficient information. First, the maximum data information to find an optimal discrete manner, then converts the value into a mutual information metric, the value of the interval [0,1].

x = np.random.normal(0,10,300)
z = x *x
pearsonr(x,z)
# 输出-0.1
from minepy import MINE
m = MINE()
m.compute_score(x, z)
print(m.mic())

  3. The distance correlation coefficient (Distance Correlation) - In order to overcome the weaknesses of the Pearson correlation coefficient born.

from scipy.spatial.distance import pdist, squareform
import numpy as np

from numbapro import jit, float32

def distcorr(X, Y):
    """ Compute the distance correlation function

    >>> a = [1,2,3,4,5]
    >>> b = np.array([1,2,9,4,4])
    >>> distcorr(a, b)
    0.762676242417
    """
    X = np.atleast_1d(X)
    Y = np.atleast_1d(Y)
    if np.prod(X.shape) == len(X):
        X = X[:, None]
    if np.prod(Y.shape) == len(Y):
        Y = Y[:, None]
    X = np.atleast_2d(X)
    Y = np.atleast_2d(Y)
    n = X.shape[0]
    if Y.shape[0] != X.shape[0]:
        raise ValueError('Number of samples must match')
    a = squareform(pdist(X))
    b = squareform(pdist(Y))
    A = a - a.mean(axis=0)[None, :] - a.mean(axis=1)[:, None] + a.mean()
    B = b - b.mean(axis=0)[None, :] - b.mean(axis=1)[:, None] + b.mean()

    dcov2_xy = (A * B).sum()/float(n * n)
    dcov2_xx = (A * A).sum()/float(n * n)
    dcov2_yy = (B * B).sum()/float(n * n)
    dcor = np.sqrt(dcov2_xy)/np.sqrt(np.sqrt(dcov2_xx) * np.sqrt(dcov2_yy))
    return dcor

  4. Sorting (Model based ranking) based on the learning model feature

The idea of this method is the direct use of machine learning algorithms you want to use to build predictive models for each individual feature and the response variable. If the relationship between the characteristic and the response is nonlinear, there are many alternatives, such as tree-based methods (decision trees, random forests), extended linear model or the like. Tree-based method is one of the easiest method because they can simulate nonlinear relationship, does not require much adjustment.
However, to avoid overfitting mainly, so the depth of the tree should be relatively small, and should be applied cross-validation.

Import load_boston sklearn.datasets from 
from sklearn.model_selection Import train_test_split, cross_val_score, ShuffleSplit 
from sklearn.preprocessing Import StandardScaler 
from sklearn.ensemble Import RandomForestRegressor 
#from sklearn.metrics Import r2_score, mean_squared_error, mean_absolute_error 

#load Boston Housing DataSet Example AS AN 
Boston = load_boston () 
#Print (boston.DESCR) 
# x_train, x_test, y_train, android.permission.FACTOR. = train_test_split (boston.data, boston.target, random_state = 33 is, test_size = 0.25) 
X-Boston = [ "Data"] 
the Y Boston = [ "target "] 
names Boston = [" feature_names "] 

RF = RandomForestRegressor (= 20 is n_estimators, MAX_DEPTH =. 4)

scores = []
# Use each feature individual training model for scoring each model as the basis for feature selection. 
I in Range for (X.shape [. 1]): 
     Score = cross_val_score (RF, X-[:, I: I +. 1], the Y, Scoring = "R2", 
                              CV = ShuffleSplit (len (X-),. 3,. . 3)) 
     scores.append ((round (np.mean (Score),. 3), names [I])) 
Print (the sorted (Scores, Reverse = True)) 


from the SVR sklearn.svm Import 
l_svr the SVR = (Kernel = ' Linear ') #poly, RBF 
l_svr.fit (x_train, y_train) 
l_svr.score (x_test, android.permission.FACTOR.) 

from sklearn.neighbors Import KNeighborsRegressor 
KNN = KNeighborsRegressor (weights = "Uniform") 
knn.fit (x_train, y_train) 
knn.score (x_test, android.permission.FACTOR.) 

from sklearn.tree Import DecisionTreeRegressor 
dt = DecisionTreeRegressor ()
dt.fit(x_train,y_train)
dt.score(x_test,y_test)

from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()
rfr.fit(x_train,y_train)
rfr.score(x_test,y_test)

from sklearn.ensemble import ExtraTreesRegressor
etr = ExtraTreesRegressor()
etr.fit(x_train,y_train)
etr.score(x_test,y_test)

from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor()
gbr.fit(x_train,y_train)
gbr.score(x_test,y_test)

  The chi-square test - only for discrete feature selection in classification problems

Chi-square describes two independent events or describe the degree of deviation of the actual observed values ​​and the expected value. The larger chi-square value, the larger the deviation actual observation value table with the expected value, but also the independence of the two events described weaker.

# Import sklearn library SelectKBest and chi2 
from sklearn.feature_selection Import SelectKBest, chi2 
# selecting the most relevant features of the first five 
X_chi2 = SelectKBest (chi2, K =. 5) .fit_transform (X-, Y) 
X_chi2.shape

  

Achieve three feature selection method: Linear Model with regularization

When all of the features on the same scale, should have the most important feature of the highest coefficient in the model, but is not related to the output characteristics should have a variable coefficient value close to zero. Even simple linear regression model, when the data is not very noisy (or have large amounts of data compared with the number of features) and features (relative) independence, this method also works well.
Regularization is to put additional constraints or penalty term is added to the existing model (loss function) on the proposed merger to prevent over-improve the generalization ability.
When using a linear model, L1 and L2 regularization regularization also referred Lasso and Ridge.
Lasso can pick out some quality features, while allowing other features coefficient tends to 0. When such as the need to reduce the number of features it is useful, but for understanding the data is not very good.
Ridge regression coefficient will uniformly allocated to the respective context variables, L2 regularization for an understanding of the more useful features of
the evolution of multiple linear regression equation to find θ. Each feature has a corresponding weight coefficients Coef, wherein positive and negative values representative of the weight coefficients and the target value is a positive feature correlation or negative correlation, the importance weight of the feature representative of the absolute value of the coefficient.

# Boston acquiring data 
from sklearn.linear_model Import LinearRegression 
boston = load_boston () 
X = boston.data 
Y = boston.target 
# outlier was filtered off 
X X = [Y <50] 
Y = Y [Y <50] 
REG = LinearRegression ( ) 
reg.fit (X, Y) 
# Coef after ordering request 
coefSort = reg.coef_.argsort () 
#featureNameSort: Effect according to the tag value, from small to large values of each feature name 
#featureCoefSore: according to the tag value Effects, from small to large coef_ 
featureNameSort = boston.feature_names [coefSort] 
featureCoefSore = reg.coef_ [coefSort] 
Print ( "featureNameSort:", featureNameSort) 
Print ( "featureCoefSore:", featureCoefSore) 

#A Helper Pretty-Printing Method for Linear Models  
DEF pretty_print_linear (coefs, names = None, Sort = False):
    IF names == None:
        names = ["X%s" % x for x in range(len(coefs))]
    lst = zip(coefs, names)
    if sort:
        lst = sorted(lst,  key = lambda x:-np.abs(x[0]))
    return " + ".join("%s * %s" % (round(coef, 3), name)
                                   for coef, name in lst)

# lasso回归
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston

boston = load_boston()
scaler = StandardScaler()
X = scaler.fit_transform(boston["data"])
Y = boston["target"]
names = boston["feature_names"]

lasso = Lasso(alpha=.3)
lasso.fit(X, Y)

print("Lasso model: {}".format(
      pretty_print_linear(lasso.coef_, names, sort = True)))

# 岭回归
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score
size = 100

#We run the method 10 times with different random seeds
for i in range(10):
    print("Random seed {}".format(i))
    np.random.seed(seed=i)
    X_seed = np.random.normal(0, 1, size)
    X1 = X_seed + np.random.normal(0, .1, size)
    X2 = X_seed + np.random.normal(0, .1, size)
    X3 = X_seed + np.random.normal(0, .1, size)
    Y = X1 + X2 + X3 + np.random.normal(0, 1, size)
    X = np.array([X1, X2, X3]).T


    lr = LinearRegression()
    lr.fit(X,Y)
    print("Linear model: {}".format(pretty_print_linear(lr.coef_)))

    ridge = Ridge(alpha=10)
    ridge.fit(X,Y)
    print("Ridge model: {}".format(pretty_print_linear(ridge.coef_)))

  

Characterized choose to implement four: Random Forest selection

  1. The average impurity reduction (mean decrease impurity)
when training the decision tree can be calculated for each feature reduces the number of the tree is not purity. For a forest tree, it can be calculated from an average of each feature reduces the number of impurity, and it does not reduce the purity of the average as a standard feature selection.

Import load_boston sklearn.datasets from 
from sklearn.ensemble Import RandomForestRegressor 
Import numpy AS NP 

#load Boston Housing DataSet Example AS AN 
Boston load_boston = () 
X-Boston = [ "Data"] 
the Y Boston = [ "target"] 
names Boston = [ " feature_names "] 
# random forest model training, and the importance of obtaining the score of each feature by feature_importances_ property. RandomForestRegressor = RF () 
rf.fit (X-, the Y) 
Print ( "Features by the sorted Their Score:") 
Print (the sorted (ZIP (Map (the lambda X: round (X,. 4), rf.feature_importances_), names), 
             reverse = True))

  2. The average precision reduction (mean decrease accuracy)

Feature selection is performed by a direct measure of the impact of each feature on the model accuracy rate.
The main idea is to disrupt the order of feature values of each feature, and measure the influence of changes in the order of accuracy of the model.
For variables is unimportant, disrupt the order effect on the accuracy of the model will not be much.
For important variables, disrupt the order will reduce the accuracy of the model.

Import ShuffleSplit sklearn.model_selection from 
from sklearn.metrics Import r2_score 
from a defaultdict Collections Import 
X-Boston = [ "Data"] 
the Y Boston = [ "target"] 
RF = RandomForestRegressor () 
Scores = a defaultdict (List) 
#crossvalidate The ON A Number Scores The Different Random of the splits of Data 
for train_idx, test_idx in ShuffleSplit (len (X-), 100, .3): 
    X_train, X-X_test = [train_idx], X-[test_idx] 
    Y_train, the Y android.permission.FACTOR. = [train_idx], the Y [test_idx ] 
    # original training model wherein the pre-modification, which is the standard for comparison acc subsequent shuffling eigenvalues. 
    rf.fit = R & lt (X_train, Y_train) 
    ACC = r2_score (android.permission.FACTOR., rf.predict (X_test)) 
    # through each column wherein  
    for i in range (X.shape [1 ]):
        X_t = X_test.copy ()
        # This column features of shuffling, the order of interaction of a value of the internal features 
        np.random.shuffle (X_t [:, I]) 
        shuff_acc = r2_score (android.permission.FACTOR., Rf.predict (X_t)) 
        # shuffling a feature after the value to calculate the average reduction degree of accuracy. . Scores [names [I]] the append ((ACC-shuff_acc) / ACC) 
Print ( "Features by the sorted Their Score:") 
Print (the sorted ([(round (np.mean (Score),. 4), feat.) for feat, score in scores.items ()] , reverse = True))

  

Implementation five feature selection: top feature selection

  1. Select Stability (Stability selection)
Its main idea is to run on different subsets of data and wherein a subset of feature selection algorithm, the constant repetition, wherein the selection result of the final summary. Statistics such a feature can be considered as an important feature of the frequency (the number was chosen as an important feature of it divided by the number in the subset being tested).
Ideally, an important feature of the score will be close to 100%. Characterized little bit weak non Score number 0, while the least useful features will be close to 0 score.

from sklearn.linear_model import RandomizedLasso
from sklearn.datasets import load_boston
boston = load_boston()
#using the Boston housing data.
#Data gets scaled automatically by sklearn's implementation
X = boston["data"]
Y = boston["target"]
names = boston["feature_names"]
rlasso = RandomizedLasso(alpha=0.025)
rlasso.fit(X, Y)
print("Features sorted by their score:")
print(sorted(zip(map(lambda x: round(x, 4), rlasso.scores_), names),
             reverse=True))

  2. recursive feature elimination (Recursive feature elimination, RFE)

The main idea of recursive feature elimination is repeated to build the model (such as SVM or regression model) and then select the best (or worst) features (can be selected according to the coefficient), the elected features into it again, and then repeat this process on the remaining features are traversed until all features.
This process is characterized in order to eliminate is the sort feature. Therefore, it is a greedy algorithm to find the optimal feature subset.
RFE stability depends largely on the ground floor at the time of the iteration of which model used.
If RFE common return used, have not been regularized regression is unstable, then the RFE is unstable.
If RFE uses Ridge, but with the return of Ridge regularization is stable, then the RFE is stable.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
boston = load_boston()
X = boston["data"]
Y = boston["target"]
names = boston["feature_names"]
#use linear regression as the model
lr = LinearRegression()
#rank all features, i.e continue the elimination until the last one
rfe = RFE(lr, n_features_to_select=1)
rfe.fit(X,Y)
print("Features sorted by their rank:")
print(sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), names)))

  Regularization of the linear model may be used for feature selection and feature understanding. Compared regularization L1, L2 regularization more stable performance, L2 regularization is suitable for understanding the data is.

  Since the response is often nonlinear relationship between the variables and characteristics, basis expansion methods can be used to convert to a more suitable feature space which, on this basis, to consider the use of simple linear model.

  Random Forest is a very popular feature selection method, it is easy to use. But it has two major problems:

  • Important features likely to score low (problems associated features)
  • This method is more advantageous for the characteristic variable categories and more features (bias issue)

  When selecting the optimal model features to improve performance, cross-validation method can be used to verify whether a process is better than other methods.
  When the method of feature selection data to understand when to pay attention to, feature selection model of stability is very important, poor stability model can easily lead to wrong conclusions.

 

 References:

Binding Scikit-learn several common feature selection

Series features projects: feature selection principle and implementation (on)

Engineering Series features: feature selection principle and implementation (under)

 

Guess you like

Origin www.cnblogs.com/iupoint/p/11289650.html