git: https://github.com/linyi0604/MachineLearning
The dataset was downloaded locally by me, you can go to my git to get the dataset
The XGBoost
boosting classifier
belongs to the ensemble learning model. It
combines hundreds of tree models with low classification accuracy. Iterates
continuously, and each iteration generates a new
tree .
Use the XGBoost model and other classifiers to predict the Titanic disaster. Compare performance
1 import pandas as pd 2 from sklearn.cross_validation import train_test_split 3 from sklearn.feature_extraction import DictVectorizer 4 from sklearn.ensemble import RandomForestClassifier 5 from xgboost import XGBClassifier 6 7 ''' 8 XGBoost 9 boost classifier 10 belongs to ensemble learning model 11 Hundreds of thousands of tree models with low classification accuracy are combined 12 and iteratively, each iteration generates a new tree 13 , 14 , and 15 to face the Titanic disaster prediction 16 Use the XGBoost model to compare the performance of other classifiers 17 18 ''' 19 20 titanic = pd.read_csv( " ../data/titanic/titanic.txt " ) 21 #Extract pclass age and sex as training samples 22 x = titanic[[ " pclass " , " age " , " sex " ]] 23 y = titanic[ " survived " ] 24 #The collected age is empty and filled with the average 25 x[ " age " ].fillna(x[ "age"].mean(), inplace= True) 26 27 #Split training data and test data 28 x_train, x_test, y_train, y_test = train_test_split(x, 29 y, 30 test_size=0.25 , 31 random_state=33 ) 32 #Extract dictionary features Vectorize 33 vec = DictVectorizer() 34 x_train = vec.fit_transform(x_train.to_dict(orient= " record " )) 35 x_test = vec.transform(x_test.to_dict(orient=" record " )) 36 37 #Use the default configuration of random forest for prediction 38 rfc = RandomForestClassifier() 39 rfc.fit(x_train, y_train) 40 print ( " Random forest prediction accuracy: " , rfc.score(x_test, y_test )) # 0.7811550151975684 41 42 #Use the XGBoost model for prediction 43 xgbc = XGBClassifier() 44 xgbc.fit(x_train, y_train) 45 print ( " XGBoost prediction accuracy: " , xgbc.score(x_test, y_test)) # 0.7872340425531915