Characterized in screening methods include: Filter (filtration), Wrapper (packaging method), Embedded (embedding)
filter: filtration
A feature selection method: removing small variation characteristic values (Removing features with low variance)
Method # simple but not easy to use, it can be characterized as a pre-selected, to remove the small variation characteristic values
if enough machine resources, and to preserve all of the information desired to be higher than the set threshold value, or only filtered discrete characteristic values of only one feature.
Discrete variables: 95% of the features are an example of the value of 1, it can be considered small role in the feature. If 100% is 1, then this feature no sense.
Continuous variables: abandon features that variance is less than a certain threshold.
from sklearn.feature_selection import VarianceThreshold X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]] sel = VarianceThreshold(threshold=(.8 * (1 - .8))) sel.fit_transform(X)
Implementation two feature selection: univariate feature selection
Univariate independent feature selection method to measure the relationship between the response variable and each feature, feature selection can be performed univariate tests for each feature,
to measure the relationship between the feature and the response variable, based on the scores discarded bad features . The method is simple, easy operation, easy to understand,
generally has a good effect (but optimization features, may not be effective for improve the generalization ability) to understand data
univariate feature selection may be used to understand the data structure, data, characteristics It can also be used to exclude irrelevant features, but can not find it redundant features.
1.Pearson correlation coefficient (Pearson Correlation) - mainly used for continuous screening feature, sensitive only linear
AS NP numpy Import from scipy.stats Import pearsonr np.random.seed (2019) size = 1000 X = np.random.normal (0,. 1, size) # calculates a correlation coefficient between two variables print ( "Lower noise {} ".format (pearsonr (X, X + np.random.normal (0,. 1, size)))) Print (" {} Higher Noise. "the format (pearsonr (X, X + np.random.normal (0, 10, size))))
2. The maximum mutual information and the coefficient information (Mutual information and maximal information coefficient)
The entropy H (Y) and the conditional entropy H (Y | X) is referred to as the difference between the mutual information - used to select only discrete features, is very sensitive to the discrete manner
since the mutual information is not easy to directly to the feature selection, so the introduction of the maximum coefficient information. First, the maximum data information to find an optimal discrete manner, then converts the value into a mutual information metric, the value of the interval [0,1].
x = np.random.normal(0,10,300) z = x *x pearsonr(x,z) # 输出-0.1 from minepy import MINE m = MINE() m.compute_score(x, z) print(m.mic())
3. The distance correlation coefficient (Distance Correlation) - In order to overcome the weaknesses of the Pearson correlation coefficient born.
from scipy.spatial.distance import pdist, squareform import numpy as np from numbapro import jit, float32 def distcorr(X, Y): """ Compute the distance correlation function >>> a = [1,2,3,4,5] >>> b = np.array([1,2,9,4,4]) >>> distcorr(a, b) 0.762676242417 """ X = np.atleast_1d(X) Y = np.atleast_1d(Y) if np.prod(X.shape) == len(X): X = X[:, None] if np.prod(Y.shape) == len(Y): Y = Y[:, None] X = np.atleast_2d(X) Y = np.atleast_2d(Y) n = X.shape[0] if Y.shape[0] != X.shape[0]: raise ValueError('Number of samples must match') a = squareform(pdist(X)) b = squareform(pdist(Y)) A = a - a.mean(axis=0)[None, :] - a.mean(axis=1)[:, None] + a.mean() B = b - b.mean(axis=0)[None, :] - b.mean(axis=1)[:, None] + b.mean() dcov2_xy = (A * B).sum()/float(n * n) dcov2_xx = (A * A).sum()/float(n * n) dcov2_yy = (B * B).sum()/float(n * n) dcor = np.sqrt(dcov2_xy)/np.sqrt(np.sqrt(dcov2_xx) * np.sqrt(dcov2_yy)) return dcor
4. Sorting (Model based ranking) based on the learning model feature
The idea of this method is the direct use of machine learning algorithms you want to use to build predictive models for each individual feature and the response variable. If the relationship between the characteristic and the response is nonlinear, there are many alternatives, such as tree-based methods (decision trees, random forests), extended linear model or the like. Tree-based method is one of the easiest method because they can simulate nonlinear relationship, does not require much adjustment.
However, to avoid overfitting mainly, so the depth of the tree should be relatively small, and should be applied cross-validation.
Import load_boston sklearn.datasets from from sklearn.model_selection Import train_test_split, cross_val_score, ShuffleSplit from sklearn.preprocessing Import StandardScaler from sklearn.ensemble Import RandomForestRegressor #from sklearn.metrics Import r2_score, mean_squared_error, mean_absolute_error #load Boston Housing DataSet Example AS AN Boston = load_boston () #Print (boston.DESCR) # x_train, x_test, y_train, android.permission.FACTOR. = train_test_split (boston.data, boston.target, random_state = 33 is, test_size = 0.25) X-Boston = [ "Data"] the Y Boston = [ "target "] names Boston = [" feature_names "] RF = RandomForestRegressor (= 20 is n_estimators, MAX_DEPTH =. 4) scores = [] # Use each feature individual training model for scoring each model as the basis for feature selection. I in Range for (X.shape [. 1]): Score = cross_val_score (RF, X-[:, I: I +. 1], the Y, Scoring = "R2", CV = ShuffleSplit (len (X-),. 3,. . 3)) scores.append ((round (np.mean (Score),. 3), names [I])) Print (the sorted (Scores, Reverse = True)) from the SVR sklearn.svm Import l_svr the SVR = (Kernel = ' Linear ') #poly, RBF l_svr.fit (x_train, y_train) l_svr.score (x_test, android.permission.FACTOR.) from sklearn.neighbors Import KNeighborsRegressor KNN = KNeighborsRegressor (weights = "Uniform") knn.fit (x_train, y_train) knn.score (x_test, android.permission.FACTOR.) from sklearn.tree Import DecisionTreeRegressor dt = DecisionTreeRegressor () dt.fit(x_train,y_train) dt.score(x_test,y_test) from sklearn.ensemble import RandomForestRegressor rfr = RandomForestRegressor() rfr.fit(x_train,y_train) rfr.score(x_test,y_test) from sklearn.ensemble import ExtraTreesRegressor etr = ExtraTreesRegressor() etr.fit(x_train,y_train) etr.score(x_test,y_test) from sklearn.ensemble import GradientBoostingRegressor gbr = GradientBoostingRegressor() gbr.fit(x_train,y_train) gbr.score(x_test,y_test)
The chi-square test - only for discrete feature selection in classification problems
Chi-square describes two independent events or describe the degree of deviation of the actual observed values and the expected value. The larger chi-square value, the larger the deviation actual observation value table with the expected value, but also the independence of the two events described weaker.
# Import sklearn library SelectKBest and chi2 from sklearn.feature_selection Import SelectKBest, chi2 # selecting the most relevant features of the first five X_chi2 = SelectKBest (chi2, K =. 5) .fit_transform (X-, Y) X_chi2.shape
Achieve three feature selection method: Linear Model with regularization
When all of the features on the same scale, should have the most important feature of the highest coefficient in the model, but is not related to the output characteristics should have a variable coefficient value close to zero. Even simple linear regression model, when the data is not very noisy (or have large amounts of data compared with the number of features) and features (relative) independence, this method also works well.
Regularization is to put additional constraints or penalty term is added to the existing model (loss function) on the proposed merger to prevent over-improve the generalization ability.
When using a linear model, L1 and L2 regularization regularization also referred Lasso and Ridge.
Lasso can pick out some quality features, while allowing other features coefficient tends to 0. When such as the need to reduce the number of features it is useful, but for understanding the data is not very good.
Ridge regression coefficient will uniformly allocated to the respective context variables, L2 regularization for an understanding of the more useful features of
the evolution of multiple linear regression equation to find θ. Each feature has a corresponding weight coefficients Coef, wherein positive and negative values representative of the weight coefficients and the target value is a positive feature correlation or negative correlation, the importance weight of the feature representative of the absolute value of the coefficient.
# Boston acquiring data from sklearn.linear_model Import LinearRegression boston = load_boston () X = boston.data Y = boston.target # outlier was filtered off X X = [Y <50] Y = Y [Y <50] REG = LinearRegression ( ) reg.fit (X, Y) # Coef after ordering request coefSort = reg.coef_.argsort () #featureNameSort: Effect according to the tag value, from small to large values of each feature name #featureCoefSore: according to the tag value Effects, from small to large coef_ featureNameSort = boston.feature_names [coefSort] featureCoefSore = reg.coef_ [coefSort] Print ( "featureNameSort:", featureNameSort) Print ( "featureCoefSore:", featureCoefSore) #A Helper Pretty-Printing Method for Linear Models DEF pretty_print_linear (coefs, names = None, Sort = False): IF names == None: names = ["X%s" % x for x in range(len(coefs))] lst = zip(coefs, names) if sort: lst = sorted(lst, key = lambda x:-np.abs(x[0])) return " + ".join("%s * %s" % (round(coef, 3), name) for coef, name in lst) # lasso回归 from sklearn.linear_model import Lasso from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_boston boston = load_boston() scaler = StandardScaler() X = scaler.fit_transform(boston["data"]) Y = boston["target"] names = boston["feature_names"] lasso = Lasso(alpha=.3) lasso.fit(X, Y) print("Lasso model: {}".format( pretty_print_linear(lasso.coef_, names, sort = True))) # 岭回归 from sklearn.linear_model import Ridge from sklearn.metrics import r2_score size = 100 #We run the method 10 times with different random seeds for i in range(10): print("Random seed {}".format(i)) np.random.seed(seed=i) X_seed = np.random.normal(0, 1, size) X1 = X_seed + np.random.normal(0, .1, size) X2 = X_seed + np.random.normal(0, .1, size) X3 = X_seed + np.random.normal(0, .1, size) Y = X1 + X2 + X3 + np.random.normal(0, 1, size) X = np.array([X1, X2, X3]).T lr = LinearRegression() lr.fit(X,Y) print("Linear model: {}".format(pretty_print_linear(lr.coef_))) ridge = Ridge(alpha=10) ridge.fit(X,Y) print("Ridge model: {}".format(pretty_print_linear(ridge.coef_)))
Characterized choose to implement four: Random Forest selection
1. The average impurity reduction (mean decrease impurity)
when training the decision tree can be calculated for each feature reduces the number of the tree is not purity. For a forest tree, it can be calculated from an average of each feature reduces the number of impurity, and it does not reduce the purity of the average as a standard feature selection.
Import load_boston sklearn.datasets from from sklearn.ensemble Import RandomForestRegressor Import numpy AS NP #load Boston Housing DataSet Example AS AN Boston load_boston = () X-Boston = [ "Data"] the Y Boston = [ "target"] names Boston = [ " feature_names "] # random forest model training, and the importance of obtaining the score of each feature by feature_importances_ property. RandomForestRegressor = RF () rf.fit (X-, the Y) Print ( "Features by the sorted Their Score:") Print (the sorted (ZIP (Map (the lambda X: round (X,. 4), rf.feature_importances_), names), reverse = True))
2. The average precision reduction (mean decrease accuracy)
Feature selection is performed by a direct measure of the impact of each feature on the model accuracy rate.
The main idea is to disrupt the order of feature values of each feature, and measure the influence of changes in the order of accuracy of the model.
For variables is unimportant, disrupt the order effect on the accuracy of the model will not be much.
For important variables, disrupt the order will reduce the accuracy of the model.
Import ShuffleSplit sklearn.model_selection from from sklearn.metrics Import r2_score from a defaultdict Collections Import X-Boston = [ "Data"] the Y Boston = [ "target"] RF = RandomForestRegressor () Scores = a defaultdict (List) #crossvalidate The ON A Number Scores The Different Random of the splits of Data for train_idx, test_idx in ShuffleSplit (len (X-), 100, .3): X_train, X-X_test = [train_idx], X-[test_idx] Y_train, the Y android.permission.FACTOR. = [train_idx], the Y [test_idx ] # original training model wherein the pre-modification, which is the standard for comparison acc subsequent shuffling eigenvalues. rf.fit = R & lt (X_train, Y_train) ACC = r2_score (android.permission.FACTOR., rf.predict (X_test)) # through each column wherein for i in range (X.shape [1 ]): X_t = X_test.copy () # This column features of shuffling, the order of interaction of a value of the internal features np.random.shuffle (X_t [:, I]) shuff_acc = r2_score (android.permission.FACTOR., Rf.predict (X_t)) # shuffling a feature after the value to calculate the average reduction degree of accuracy. . Scores [names [I]] the append ((ACC-shuff_acc) / ACC) Print ( "Features by the sorted Their Score:") Print (the sorted ([(round (np.mean (Score),. 4), feat.) for feat, score in scores.items ()] , reverse = True))
Implementation five feature selection: top feature selection
1. Select Stability (Stability selection)
Its main idea is to run on different subsets of data and wherein a subset of feature selection algorithm, the constant repetition, wherein the selection result of the final summary. Statistics such a feature can be considered as an important feature of the frequency (the number was chosen as an important feature of it divided by the number in the subset being tested).
Ideally, an important feature of the score will be close to 100%. Characterized little bit weak non Score number 0, while the least useful features will be close to 0 score.
from sklearn.linear_model import RandomizedLasso from sklearn.datasets import load_boston boston = load_boston() #using the Boston housing data. #Data gets scaled automatically by sklearn's implementation X = boston["data"] Y = boston["target"] names = boston["feature_names"] rlasso = RandomizedLasso(alpha=0.025) rlasso.fit(X, Y) print("Features sorted by their score:") print(sorted(zip(map(lambda x: round(x, 4), rlasso.scores_), names), reverse=True))
2. recursive feature elimination (Recursive feature elimination, RFE)
The main idea of recursive feature elimination is repeated to build the model (such as SVM or regression model) and then select the best (or worst) features (can be selected according to the coefficient), the elected features into it again, and then repeat this process on the remaining features are traversed until all features.
This process is characterized in order to eliminate is the sort feature. Therefore, it is a greedy algorithm to find the optimal feature subset.
RFE stability depends largely on the ground floor at the time of the iteration of which model used.
If RFE common return used, have not been regularized regression is unstable, then the RFE is unstable.
If RFE uses Ridge, but with the return of Ridge regularization is stable, then the RFE is stable.
from sklearn.feature_selection import RFE from sklearn.linear_model import LinearRegression boston = load_boston() X = boston["data"] Y = boston["target"] names = boston["feature_names"] #use linear regression as the model lr = LinearRegression() #rank all features, i.e continue the elimination until the last one rfe = RFE(lr, n_features_to_select=1) rfe.fit(X,Y) print("Features sorted by their rank:") print(sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), names)))
Regularization of the linear model may be used for feature selection and feature understanding. Compared regularization L1, L2 regularization more stable performance, L2 regularization is suitable for understanding the data is.
Since the response is often nonlinear relationship between the variables and characteristics, basis expansion methods can be used to convert to a more suitable feature space which, on this basis, to consider the use of simple linear model.
Random Forest is a very popular feature selection method, it is easy to use. But it has two major problems:
- Important features likely to score low (problems associated features)
- This method is more advantageous for the characteristic variable categories and more features (bias issue)
When selecting the optimal model features to improve performance, cross-validation method can be used to verify whether a process is better than other methods.
When the method of feature selection data to understand when to pay attention to, feature selection model of stability is very important, poor stability model can easily lead to wrong conclusions.
References:
Binding Scikit-learn several common feature selection
Series features projects: feature selection principle and implementation (on)
Engineering Series features: feature selection principle and implementation (under)