1. Why data dimensionality reduction?
In the actual machine learning programs, feature selection / dimension reduction is necessary because of the following issues exist in the data:
- Multicollinearity data: there is mutual association between the characteristic attributes. Multicollinearity space solution will lead to instability, leading to the generalization ability of the model is weak;
- Cushman & Wakefield space sparse sample, leading to model more difficult to find data characteristics;
- Too many variables to find the law would prevent the model;
- Only consider a single variable for the target property may ignore the potential impact of the relationship between variables.
Characterized by selecting object / dimensionality reduction is:
- Reducing the number of characteristic properties
- Wherein property is secured between the independent
Of course, sometimes there are also a feature matrix is too large, resulting in computationally intensive, long training time problem
The benefits of reduced dimension:
- And to reduce the data storage space required dimensions
- Training model saves computing time
- Remove redundant variables, improve the accuracy of the algorithm
- In favor of data visualization
2.SelectFromModel
Here give an example of iris, iris has four characteristics, but not every feature is important, so sift through, to see which features more most
1 # 导包 2 # 特征选择需要的包 3 from sklearn.feature_selection import SelectFromModel 4 # 一些模型 5 from sklearn.linear_model import LogisticRegression 6 from sklearn.tree import DecisionTreeClassifier 7 from sklearn.neighbors import KNeighborsClassifier 8 9 from sklearn import datasets 10 11 import numpy as np 12 13 import warnings 14 warnings.filterwarnings('ignore')
. 1 X-, Y = datasets.load_iris (True) 2 . 3 estimator = LogisticRegression () . 4 # max_features retains several characteristics of the threshold estimator threshold model . 5 SFM = SelectFromModel (= estimator estimator, threshold = -np.inf, max_features = 2 ) . 6 = an X2 sfm.fit_transform (X-, Y) . 7 an X2 [: 10]
Filter out the results:
Original Results: X [: 10]
As can be seen, the third and fourth characteristic feature of the song is singled out, showing them the importance high
1 # coefficient weights corresponding to 2 np.abs (sfm.estimator_.coef _). Mean (Axis = 0)
Operating results can be seen, significant weight 3,4
1 # important feature of the extent of 2 sfm.estimator_.feature_importances_
1 # 方差 2 from sklearn.feature_selection import VarianceThreshold 3 4 vt = VarianceThreshold(threshold=0.68) 5 X3 = vt.fit_transform(X,y) 6 X.var(axis = 0)
根据方差筛选的话,挑选出来的是1,3特征
3.SelectKBest
这个其实是统计学中的卡方验证
1 from sklearn.datasets import load_iris 2 from sklearn.feature_selection import SelectKBest 3 from sklearn.feature_selection import chi2 4 5 iris = load_iris() 6 X, y = iris.data, iris.target 7 8 # k=2 选择两个最高的分数对应的属性 9 # chi 是统计学中的卡方验证,测试随机变量之间的相关性 10 X_new = SelectKBest(chi2, k=2).fit_transform(X, y) 11 X_new.shape 12 13 X[:10] 14 X_new[:10]
根据卡方分布,挑选出来的依旧是3,4属性
验证一下:
1 chi2(X,y)
4.RFE 递归特征消除
1 from sklearn.feature_selection import RFE 2 from sklearn.tree import DecisionTreeClassifier 3 4 estimator=DecisionTreeClassifier() 5 rfe=RFE(estimator) 6 X2=rfe.fit_transform(X,y) 7 X2[:10] 8 9 model=rfe.estimator_ 10 model.feature_importances_ 11 12 estimator.fit(X,y) 13 estimator.feature_importances_
model.feature_importances_
estimator.feature_importances_
5.补充一下卡方检验和卡方分布
卡方检验:
卡方检验是用途非常广的一种假设检验方法,它在分类资料统计推断中的应用,包括:两个率或两个构成比比较的卡方检验;多个率或多个构成比比较的卡方检验以及分类资料的相关分析等。
计算公式:
这里等博主学习后回来更新.......