Learned the api of decision tree classifier using python3
It involves feature extraction, data type retention, and classification type to extract new types
Need to download datasets online, I downloaded them locally,
You can download the code and dataset from my git: https://github.com/linyi0604/MachineLearning
1 import pandas as pd 2 from sklearn.cross_validation import train_test_split 3 from sklearn.feature_extraction import DictVectorizer 4 from sklearn.tree import DecisionTreeClassifier 5 from sklearn.metrics import classification_report 6 7 ''' 8 Decision tree 9 involves multiple features, no obvious Linear relationship 10 Inference logic is intuitive 11 No need to normalize data 12 ''' 13 14 ''' 15 1 Prepare data 16 ''' 17 #Read Titanic passenger data, which has been downloaded from the Internet to the local 18 titanic = pd.read_csv( " ./data/titanic/titanic.txt " ) 19 #Observe the data and find that there are missing phenomena 20 # print( titanic.head()) 21 22 #Extract key features, sex, age, pclass are likely to affect whether or not to survive 23 x = titanic[[ ' pclass ' , ' age ' , ' sex ' ]] 24 y = titanic[ ' survived ' ] 25 #View the currently selected feature 26 # print(x.info()) 27 ''' 28 <class 'pandas.core.frame.DataFrame'> 29 RangeIndex: 1313 entries, 0 to 1312 30 Data columns (total 3 columns): 31 pclass 1313 non-null object 32 age 633 non-null float64 33 sex 1313 non-null object 34 dtypes: float64(1), object(2) 35 memory usage: 30.9+ KB 36 None 37 ''' 38 # There are only 633 age data columns, for vacancies The use of mean or median to supplement hope to have little impact on the model 39 x[ ' age ' ].fillna(x[ ' age ' ].mean(),inplace=True) 40 41 ''' 42 2 Data split 43 ''' 44 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=33 ) 45 #Use feature transformer for feature extraction 46 vec = DictVectorizer() 47 #The data of the category type will be extracted and the data type will remain unchanged 48 x_train = vec.fit_transform(x_train.to_dict(orient= " record " )) 49 # print(vec.feature_names_) # ['age ', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', 'sex=female', 'sex=male'] 50 x_test = vec.transform(x_test.to_dict(orient="record" )) 51 52 ''' 53 3 Train the model for prediction 54 ''' 55 #Initialize the decision tree classifier 56 dtc = DecisionTreeClassifier() 57 #Train 58 dtc.fit (x_train, y_train) 59 #Predict and save the result 60 y_predict = dtc.predict(x_test) 61 62 ''' 63 4 Model evaluation 64 ''' 65 print ( " Accuracy: " , dtc.score(x_test, y_test)) 66 print ( " Other metrics:\n " , classification_report(y_predict, y_test, target_names=[ ' died ' , ' survived ' ])) 67 ''' 68 Accuracy: 0.7811550151975684 69 Other metrics: 70 precision recall f1-score support 71 72 died 0.91 0.78 0.84 236 73 survived 0.58 0.80 0.67 93 74 75 avg / total 0.81 0.78 0.79 329 76 '''