The fifth chapter of machine learning and practice Python program from scratch Road to Kaggle contest of the decision tree to predict the number of survivors of the Titanic

Disclaimer: All belongs to the world proletariat all https://blog.csdn.net/qq_41776781/article/details/89157762

Preface: This section test tree (DecisionTreeClassifier) forecast the number of Titanic survivors, and select Properties when training for age, gender and Pclass, predicted label to choose whether survivors. Decision Tree most suitable for the data set may be the presence of non-linear relationship, for example in accordance with the age-predicted the number of survivors of the Titanic, it is obviously too young or too cosmopolitan impact of lower production rate, but the possibility of a relatively middle-aged survivors higher, so I attribute it to generate non-linear relationship may exist in terms of probability, so in this case you can try using a decision tree to predict classification.

Note that when selecting attributes, attribute label source is given in lowercase, but the property of the label in the test file is capitalized, it is recommended that the tag code data sets modified to uppercase, to avoid not found the whole idea attributes corresponding to the following procedure, but explained later on the code correlation function do explanation.

1, the training data set is loaded, including the data set into a training set and test set

2, the data set using DictVectorizer preprocessing, including the operation of the missing data filled,

3, generates a DecisionTreeClassifier () fit the training data object function, the use of object

code show as below:

# -*- coding: utf-8 -*-
# @Time    : 2019/4/8 8:47
# @Author  : YYLin
# @Email   : [email protected]
# @File    : Five-Program-DecisionTreeClassifier-Titanic.py
# 读取数据集
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

titanic = pd.read_csv('../Dataset/Tencent-Datasets/Titanic/train.csv')
print("显示数据集中前5行",titanic.head())
# 查看数据集中每个数据有多少有用数据
titanic.info()

# 选择数据集中列作为特征，并对数据集进行补充
X = titanic[['Pclass', 'Age', 'Sex']]
y = titanic['Survived']
X.info()

X['Age'].fillna(X['Age'].mean(), inplace=True)
X.info()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 33)

# 按照字典进行数据集规格化
vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(X_train.to_dict(orient='record'))
X_test = vec.transform(X_test.to_dict(orient='record'))

# 使用决策树对对数据集进行分类
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

# 打印显示分类的结果
y_predict = dtc.predict(X_test)
print(dtc.score(X_test, y_test))
print (classification_report(y_predict, y_test, target_names = ['died', 'survived']))

The results:

The fifth chapter of machine learning and practice Python program from scratch Road to Kaggle contest of the decision tree to predict the number of survivors of the Titanic

Guess you like