Data reduction, feature selection

1. Why data dimensionality reduction?

In the actual machine learning programs, feature selection / dimension reduction is necessary because of the following issues exist in the data:

  • Multicollinearity data: there is mutual association between the characteristic attributes. Multicollinearity space solution will lead to instability, leading to the generalization ability of the model is weak;
  • Cushman & Wakefield space sparse sample, leading to model more difficult to find data characteristics;
  • Too many variables to find the law would prevent the model;
  • Only consider a single variable for the target property may ignore the potential impact of the relationship between variables.

Characterized by selecting object / dimensionality reduction is:

  • Reducing the number of characteristic properties
  • Wherein property is secured between the independent

Of course, sometimes there are also a feature matrix is ​​too large, resulting in computationally intensive, long training time problem

The benefits of reduced dimension:

  • And to reduce the data storage space required dimensions
  • Training model saves computing time
  • Remove redundant variables, improve the accuracy of the algorithm
  • In favor of data visualization

2.SelectFromModel

Here give an example of iris, iris has four characteristics, but not every feature is important, so sift through, to see which features more most

 1 # 导包
 2 # 特征选择需要的包
 3 from sklearn.feature_selection import SelectFromModel
 4 # 一些模型
 5 from sklearn.linear_model import LogisticRegression
 6 from sklearn.tree import DecisionTreeClassifier
 7 from sklearn.neighbors import KNeighborsClassifier
 8 
 9 from sklearn import datasets
10 
11 import numpy as np
12 
13 import warnings
14 warnings.filterwarnings('ignore')
. 1 X-, Y = datasets.load_iris (True)
 2  
. 3 estimator = LogisticRegression ()
 . 4  # max_features retains several characteristics of the threshold estimator threshold model 
. 5 SFM = SelectFromModel (= estimator estimator, threshold = -np.inf, max_features = 2 )
 . 6 = an X2 sfm.fit_transform (X-, Y)
 . 7 an X2 [: 10]

Filter out the results:

Original Results: X [: 10]

As can be seen, the third and fourth characteristic feature of the song is singled out, showing them the importance high

1  # coefficient weights corresponding to 
2 np.abs (sfm.estimator_.coef _). Mean (Axis = 0)

Operating results can be seen, significant weight 3,4

1  # important feature of the extent of 
2 sfm.estimator_.feature_importances_

 

1 # 方差
2 from sklearn.feature_selection import VarianceThreshold
3 
4 vt = VarianceThreshold(threshold=0.68)
5 X3 = vt.fit_transform(X,y)
6 X.var(axis = 0)

 

根据方差筛选的话,挑选出来的是1,3特征

3.SelectKBest

这个其实是统计学中的卡方验证

 

 1 from sklearn.datasets import load_iris
 2 from sklearn.feature_selection import SelectKBest
 3 from sklearn.feature_selection import chi2
 4 
 5 iris = load_iris()
 6 X, y = iris.data, iris.target
 7 
 8 # k=2 选择两个最高的分数对应的属性
 9 # chi 是统计学中的卡方验证,测试随机变量之间的相关性
10 X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
11 X_new.shape
12 
13 X[:10]
14 X_new[:10]

 


 

根据卡方分布,挑选出来的依旧是3,4属性

验证一下:

1 chi2(X,y)

4.RFE 递归特征消除

 1 from sklearn.feature_selection import RFE
 2 from sklearn.tree import DecisionTreeClassifier
 3 
 4 estimator=DecisionTreeClassifier()
 5 rfe=RFE(estimator)
 6 X2=rfe.fit_transform(X,y)
 7 X2[:10]
 8 
 9 model=rfe.estimator_
10 model.feature_importances_
11 
12 estimator.fit(X,y)
13 estimator.feature_importances_

model.feature_importances_

estimator.feature_importances_

 

 

5.补充一下卡方检验和卡方分布

卡方检验:

卡方检验是用途非常广的一种假设检验方法,它在分类资料统计推断中的应用,包括:两个率或两个构成比比较的卡方检验;多个率或多个构成比比较的卡方检验以及分类资料的相关分析等。

计算公式:

 

 这里等博主学习后回来更新.......

Guess you like

Origin www.cnblogs.com/xiuercui/p/11978899.html