Feature extraction:
a means of feature dimensionality reduction
Discard features that are not related to the
results Discard features that are less related to the results
In this way,
there are too many features in the dimensionality reduction dataset, some of which have nothing to do with the results.
At this time, it will not matter If the features are deleted, better prediction results can be
obtained
instead
python3 learn to use api
Using the networked dataset, I have downloaded it locally and can download the dataset from my git
git: https://github.com/linyi0604/MachineLearning
Code:
1 import pandas as pd 2 from sklearn.cross_validation import train_test_split 3 from sklearn.feature_extraction import DictVectorizer 4 from sklearn.tree import DecisionTreeClassifier 5 from sklearn import feature_selection 6 from sklearn.cross_validation import cross_val_score 7 import numpy as np 8 import pylab as pl 9 10 ''' 11 特征提取: 12 The means of feature dimensionality reduction 13 Discard the features that have no connection to the result 14 Discard the features that have less connection to the result 15 In this way, reduce the dimension 16 17 There are too many features in the data set, some of which have nothing to do with the result, 18 At this time, Deleting unrelated features can get better prediction results 19 20 Use decision trees below to predict the survival of the Titanic, 21 Select different percentages of features to learn and predict, and compare accuracy 22 ''' 23 24 # 1 Prepare data 25 titanic = pd.read_csv( " ../data/titanic/titanic.txt " ) 26 # Separate data features and targets 27 y = titanic[ " survived " ] 28 x = titanic.drop([ "row.names " , " name " , " survived " ], axis=1 ) 29 #Complement missing values by 30 x[ ' age ' ].fillna(x[ ' age ' ].mean(), inplace= True) 31 x.fillna( " UNKNOWN " , inplace= True) 32 33 # 2 split dataset 25% for testing 75% for training 34 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state =33 ) 35 36 # 3 categorical feature vectorization 37 vec = DictVectorizer() 38 x_train = vec.fit_transform(x_train.to_dict(orient= ' record ' )) 39 x_test = vec.transform(x_test.to_dict(orient= ' record ' )) 40 #Output the dimension of the processed vector 41 # print(len(vec.feature_names_)) # 474 42 43 # 4 Use decision tree to learn and predict all features 44 dt = DecisionTreeClassifier(criterion= ' entropy ' ) 45 dt.fit(x_train, y_train) 46 print ( " Prediction accuracy for all dimensions: ", dt.score(x_test, y_test)) # 0.8206686930091185 47 48 # 5 Filter the top 20% of features and use the same configuration of the decision tree model to evaluate performance 49 fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=20 ) 50 x_train_fs = fs.fit_transform(x_train, y_train) 51 x_test_fs = fs.transform(x_test) 52 dt.fit(x_train_fs, y_train) 53 print ( " The learning model prediction accuracy of the top 20% features: " , dt.score(x_test_fs, y_test)) # 0.8237082066869301 54 55 # 6 Filter features with a fixed interval percentage through cross-validation, showing performance 56 percentiles = range(1, 100, 2) 57 results = [] 58 for i in percentiles: 59 fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=i) 60 x_train_fs = fs.fit_transform(x_train, y_train) 61 scores = cross_val_score(dt, x_train_fs, y_train, cv=5) 62 results = np.append(results, scores.mean()) 63 # print(results) 64 ''' 65 [0.85063904 0.85673057 0.87501546 0.88622964 0.86284271 0.86489384 66 0.87303649 0.86689342 0.87098536 0.86690373 0.86895485 0.86083282 67 0.86691404 0.86488353 0.86895485 0.86792414 0.86284271 0.86995465 68 0.86486291 0.86385281 0.86384251 0.86894455 0.86794475 0.86690373 69 0.86488353 0.86489384 0.86590394 0.87300557 0.86995465 0.86793445 70 0.87097506 0.86998557 0.86692435 0.86892393 0.86997526 0.87098536 71 0.87198516 0.86691404 0.86691404 0.87301587 0.87202639 0.8648423 72 0.86386312 0.86388374 0.86794475 0.8618223 0.85877139 0.86285302 73 0.86692435 0.8577819] 74 ''' 75 #Find the screening percentage of the best performance 76 opt = np.where(results ==results.max())[0][0] 77 print ( " Percentile of the highest performing filter is: %s%% " % percentiles[opt]) # 7 78 79 pl.plot(percentiles, results) 80 pl.xlabel ( " Percentage of Feature Screening " ) 81 pl.ylabel( " Accuracy " ) 82 pl.show()
Generated accuracy graph: