The Road to Machine Learning: Python Feature Dimensionality Reduction Feature Screening feature_selection

 

Feature extraction: 
a means of feature dimensionality reduction
Discard features that are not related to the
results Discard features that are less related to the results
In this way,

there are too many features in the dimensionality reduction dataset, some of which have nothing to do with the results.
At this time, it will not matter If the features are deleted, better prediction results can be

obtained
instead

python3 learn to use api

Using the networked dataset, I have downloaded it locally and can download the dataset from my git

git: https://github.com/linyi0604/MachineLearning

Code:

 1 import pandas as pd
 2 from sklearn.cross_validation import train_test_split
 3 from sklearn.feature_extraction import DictVectorizer
 4 from sklearn.tree import DecisionTreeClassifier
 5 from sklearn import feature_selection
 6 from sklearn.cross_validation import cross_val_score
 7 import numpy as np
 8 import pylab as pl
 9 
10 '''
11 特征提取:
12     The means of feature dimensionality reduction
 13      Discard the features that have no connection to the
 result 14      Discard the features that have less connection to the result
 15      In this way, reduce the dimension
 16      
17  There are too many features in the data set, some of which have nothing to do with the result,
 18 At  this time, Deleting unrelated features can get better prediction results
 19  
20  Use decision trees below to predict the survival of the Titanic,
 21  Select different percentages of features to learn and predict, and compare accuracy
 22  ''' 
23  
24  # 1 Prepare data 
25 titanic = pd.read_csv( " ../data/titanic/titanic.txt " )
 26  # Separate data features and targets 
27 y = titanic[ " survived " ]
 28 x = titanic.drop([ "row.names " , " name " , " survived " ], axis=1 )
 29  #Complement missing values ​​by 
30 x[ ' age ' ].fillna(x[ ' age ' ].mean(), inplace= True)
 31 x.fillna( " UNKNOWN " , inplace= True)
 32  
33  # 2 split dataset 25% for testing 75% for training 
34 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state =33 )
 35  
36  # 3 categorical feature vectorization
37 vec = DictVectorizer()
 38 x_train = vec.fit_transform(x_train.to_dict(orient= ' record ' ))
 39 x_test = vec.transform(x_test.to_dict(orient= ' record ' ))
 40  #Output the dimension of the processed vector 
41  # print(len(vec.feature_names_)) # 474 
42  
43  # 4 Use decision tree to learn and predict all features 
44 dt = DecisionTreeClassifier(criterion= ' entropy ' )
 45  dt.fit(x_train, y_train)
 46  print ( " Prediction accuracy for all dimensions: ", dt.score(x_test, y_test))   # 0.8206686930091185 
47  
48  # 5 Filter the top 20% of features and use the same configuration of the decision tree model to evaluate performance 
49 fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=20 )
 50 x_train_fs = fs.fit_transform(x_train, y_train)
 51 x_test_fs = fs.transform(x_test)
 52  dt.fit(x_train_fs, y_train)
 53  print ( " The learning model prediction accuracy of the top 20% features: " , dt.score(x_test_fs, y_test))      # 0.8237082066869301 
54  
55  # 6 Filter features with a fixed interval percentage through cross-validation, showing performance 
56 percentiles = range(1, 100, 2)
57 results = []
58 for i in percentiles:
59     fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=i)
60     x_train_fs = fs.fit_transform(x_train, y_train)
61     scores = cross_val_score(dt, x_train_fs, y_train, cv=5)
62     results = np.append(results, scores.mean())
63 # print(results)
64 '''
65 [0.85063904 0.85673057 0.87501546 0.88622964 0.86284271 0.86489384
66 0.87303649 0.86689342 0.87098536 0.86690373 0.86895485 0.86083282
 67  0.86691404 0.86488353 0.86895485 0.86792414 0.86284271 0.86995465
 68  0.86486291 0.86385281 0.86384251 0.86894455 0.86794475 0.86690373
 69  0.86488353 0.86489384 0.86590394 0.87300557 0.86995465 0.86793445
 70  0.87097506 0.86998557 0.86692435 0.86892393 0.86997526 0.87098536
 71  0.87198516 0.86691404 0.86691404 0.87301587 0.87202639 0.8648423
 72  0.86386312 0.86388374 0.86794475 0.8618223 0.85877139 0.86285302
 73  0.86692435 0.8577819]
 74  ''' 
75  #Find the screening percentage of the best performance 
76 opt = np.where(results ==results.max())[0][0]
 77  print ( " Percentile of the highest performing filter is: %s%% " % percentiles[opt])   # 7 
78  
79  pl.plot(percentiles, results)
 80 pl.xlabel ( " Percentage of Feature Screening " )
 81 pl.ylabel( " Accuracy " )
 82 pl.show()

Generated accuracy graph:

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325066241&siteId=291194637
Recommended