python3 learns the api for gradient boosting decision tree classification using random forest classifiers and compares them with the prediction results of a single decision tree
Attach my git, you are welcome to refer to the code of my other classifiers: https://github.com/linyi0604/MachineLearning
1 import pandas as pd 2 from sklearn.cross_validation import train_test_split 3 from sklearn.feature_extraction import DictVectorizer 4 from sklearn.tree import DecisionTreeClassifier 5 from sklearn.metrics import classification_report 6 from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier 7 8 ''' 9 Integrated classification 10. Consider the prediction results of multiple classifiers comprehensively . 11 This kind of comprehensive consideration is generally divided into two types: 12 1 Build multiple independent classification models, and then through voting methods such as random forest classifiers 13 Random forest builds multiple decision trees on the training data at the same time, and these decision trees are constructed during the construction. At times, the unique algorithm will be abandoned, and features will be randomly selected 14 2 Build multiple classification models in a certain order, 15 There is a dependency between them, and the addition of each subsequent model requires the comprehensive performance contribution of the existing model, 16 From multiple weaker The classifier builds a relatively powerful classifier, such as the gradient boosting decision tree 17. The admiral forest decision tree is built to minimize the error of the adult in fitting the data. 18 19 The following will compare the prediction of a single decision tree random forest gradient boosting decision tree 20 21 ''' 22 23 ''' 24 1 Prepare data 25 ''' 26 #Read Titanic passenger data, which has been downloaded from the Internet to the local 27 titanic = pd.read_csv( " ./data/titanic/titanic.txt" ) 28 #Observe the data and find that there is a missing phenomenon 29 # print(titanic.head()) 30 31 #Extract key features, sex, age, pclass are likely to affect whether or not to survive 32 x = titanic[[ ' pclass ' , ' age ' , ' sex ' ]] 33 y = titanic[ ' survived ' ] 34 #View currently selected features 35 # print(x.info()) 36 ''' 37 <class 'pandas.core.frame.DataFrame' > 38 RangeIndex: 1313 entries, 0 to 1312 39 Data columns (total 3 columns): 40 pclass 1313 non-null object 41 age 633 non-null float64 42 sex 1313 non-null object 43 dtypes: float64(1), object(2) 44 memory usage: 30.9+ KB 45 None 46 ''' 47 # There are only 633 age data columns. For vacancies, the mean or median is used to supplement the model. 48 x[ ' age ' ].fillna(x[ ' age ' ].mean( ), inplace= True) 49 50 ''' 51 2 Data split 52 ''' 53x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=33 ) 54 #Use feature converter for feature extraction 55 vec = DictVectorizer() 56 #Category data will be extracted from data type will remain the same 57 x_train = vec.fit_transform(x_train.to_dict(orient= " record " )) 58 # print(vec.feature_names_) # ['age', 'pclass=1st', 'pclass=2nd', 'pclass =3rd', 'sex=female', 'sex=male'] 59 x_test = vec.transform(x_test.to_dict(orient= " record " )) 60 61 ''' 62 3. 1 Single decision tree training model for prediction 63 ''' 64 #Initialize decision tree classifier 65 dtc = DecisionTreeClassifier() 66 #Training 67 dtc.fit (x_train, y_train) 68 #Predict save result 69 dtc_y_predict = dtc.predict(x_test) 70 71 ''' 72 3.2 Use Random forest training model for prediction 73 ''' 74 #initialize random forest classifier 75 rfc = RandomForestClassifier() 76 #training 77 rfc.fit (x_train, y_train) 78 #predict 79 rfc_y_predict = rfc.predict(x_test) 80 81 ''' 82 3.3 Use gradient boosting decision tree for model training and prediction 83 ''' 84 #Initialize the classifier 85 gbc = GradientBoostingClassifier() 86 #train 87 gbc.fit (x_train, y_train) 88 #Predict 89 gbc_y_predict = gbc.predict(x_test) 90 91 92 ''' 93 4 Model evaluation 94 ''' 95 print ( " Single decision tree accuracy: " , dtc.score(x_test, y_test)) 96 print ( "Additional metrics:\n " , classification_report(dtc_y_predict, y_test, target_names=[ ' died ' , ' survived ' ])) 97 98 print ( " Random forest accuracy: " , rfc.score(x_test, y_test)) 99 print ( " Other metrics:\n " , classification_report(rfc_y_predict, y_test, target_names=[ ' died ' , ' survived ' ])) 100 101 print ( " Gradient boosted decision tree accuracy: " , gbc.score(x_test, y_test)) 102 print(" Other metrics:\n " , classification_report(gbc_y_predict, y_test, target_names=[ ' died ' , ' survived ' ])) 103 104 ''' 105 Single decision tree accuracy: 0.7811550151975684 106 Other metrics: 107 precision recall f1-score support 108 109 died 0.91 0.78 0.84 236 110 survived 0.58 0.80 0.67 93 111 112 avg / total 0.81 0.78 0.79 329 113 114 Random forest accuracy: 0.78419452887538 115 Other metrics: 116 precision recall f1-score support 117 118 died 0.91 0.78 0.84 237 119 survived 0.58 0.80 0.68 92 120 121 avg / total 0.82 0.78 0.79 329 122 123 126 127 died 0.92 0.78 0.84 239 128 survived 0.58 0.82 0.68 90 129 130 avg / total 0.83 0.79 0.80 329 131 132 '''