The Road to Machine Learning: Python Decision Tree Classification to Predict Survival of Titanic Passengers

 

 

Learned the api of decision tree classifier using python3

It involves feature extraction, data type retention, and classification type to extract new types

Need to download datasets online, I downloaded them locally,

You can download the code and dataset from my git:  https://github.com/linyi0604/MachineLearning

 

1  import pandas as pd
 2  from sklearn.cross_validation import train_test_split
 3  from sklearn.feature_extraction import DictVectorizer
 4  from sklearn.tree import DecisionTreeClassifier
 5  from sklearn.metrics import classification_report
 6  
7  ''' 
8  Decision tree
 9  involves multiple features, no obvious Linear relationship
 10  Inference logic is intuitive
 11  No need to normalize data
 12  ''' 
13  
14  ''' 
15  1 Prepare data
16  ''' 
17  #Read Titanic passenger data, which has been downloaded from the Internet to the local 
18 titanic = pd.read_csv( " ./data/titanic/titanic.txt " )
 19  #Observe the data and find that there are missing phenomena 
20  # print( titanic.head()) 
21  
22  #Extract key features, sex, age, pclass are likely to affect whether or not to survive 
23 x = titanic[[ ' pclass ' , ' age ' , ' sex ' ]]
 24 y = titanic[ ' survived ' ]
 25 #View the currently selected feature 26 # 
 print(x.info()) 
27  ''' 
28  <class 'pandas.core.frame.DataFrame'>
 29  RangeIndex: 1313 entries, 0 to 1312
 30  Data columns (total 3 columns):
 31  pclass 1313 non-null object
 32  age 633 non-null float64
 33  sex 1313 non-null object
 34  dtypes: float64(1), object(2)
 35  memory usage: 30.9+ KB
 36  None
 37  ''' 
38  # There are only 633 age data columns, for vacancies The use of mean or median to supplement hope to have little impact on the model 
39 x[ ' age ' ].fillna(x[ ' age ' ].mean(),inplace=True)
 40  
41  ''' 
42  2 Data split
 43  ''' 
44 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=33 )
 45  #Use feature transformer for feature extraction 
46 vec = DictVectorizer()
 47  #The data of the category type will be extracted and the data type will remain unchanged 
48 x_train = vec.fit_transform(x_train.to_dict(orient= " record " ))
 49  # print(vec.feature_names_) # ['age ', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', 'sex=female', 'sex=male'] 
50 x_test = vec.transform(x_test.to_dict(orient="record" ))
 51  
52  ''' 
53  3 Train the model for prediction
 54  ''' 
55  #Initialize the decision tree classifier 
56 dtc = DecisionTreeClassifier()
 57  #Train 58 dtc.fit 
(x_train, y_train)
 59 #Predict and save the result 60 y_predict = dtc.predict(x_test)
 61 62 ''' 63 4 Model evaluation
 64 ''' 65 print ( " Accuracy: " , dtc.score(x_test, y_test))
 66 print ( " Other metrics:\n "  
 
 
  
  , classification_report(y_predict, y_test, target_names=[ ' died ' , ' survived ' ]))
 67  ''' 
68  Accuracy: 0.7811550151975684
 69  Other metrics:
 70                precision recall f1-score support
 71  
72         died 0.91 0.78 0.84 236
 73     survived 0.58 0.80 0.67 93
 74  
75  avg / total 0.81 0.78 0.79 329
 76  '''

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325019150&siteId=291194637