第7章 DecisionTree Titanic 幸存者预测

数据分析

Titanic数据集(892行,12列的数据表格)
在这里插入图片描述
PassengerID,Name这两个数据明显与生存率无关,丢弃。

Carbin:仓位,虽然与生存率有一定关系,但数据大量丢失,而且没有更多的数据来对船舱进行归类,所以也丢弃。

Embarked:乘客登船的港口。需要把S/C/Q等数据转换为数值型数据!
Sex:性别。同样要转换成0,1数值型数据。

对于缺失数据,进行填充。

最后必须提取Survived列作为标注数据。

Pandas处理表格数据

将只有两个属性的特征(性别)转换为0,1可用布尔运算!下文还有一种方法:LabelEncoder()

将有多个属性的特征转换为0,1,2等数值型数据,用到:
unique():Return unique values in the object
tolist():Return a list of the values
apply():Applies function along input axis of DataFrame
index():Immutable ndarray implementing an ordered, sliceable set.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

def read_dataset(fname):

    # 指定第一列作为行索引
    data = pd.read_csv(fname, index_col=0) 
    
    # 丢弃无用的数据
    data.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
    
    # 处理性别数据
    # 男性设为1,女性即为0
    data['Sex'] = (data['Sex'] == 'male').astype('int')
    
    # 处理登船港口数据
    labels = data['Embarked'].unique().tolist()
    data['Embarked'] = data['Embarked'].apply(lambda n: labels.index(n))
    
    # 处理缺失数据
    data = data.fillna(0)
    return data

train = read_dataset(r'C:\Users\Qiuyi\Desktop\titanic\train.csv')

train.head()

在这里插入图片描述

提取Survived标注数据,并划分训练集和测试集:

from sklearn.model_selection import train_test_split

y = train['Survived'].values
X = train.drop(['Survived'], axis=1).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

print('train dataset: {0}; test dataset: {1}'.format(
    X_train.shape, X_test.shape))

train dataset: (712, 7); test dataset: (179, 7)

用决策树训练

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
train_score = clf.score(X_train, y_train)
test_score = clf.score(X_test, y_test)
print('train score: {0}; test score: {1}'.format(train_score, test_score))

train score: 0.9859550561797753; test score: 0.7932960893854749

过拟合!
解决办法是剪枝,scikit-learn不支持后剪枝,因此用max_depth等参数进行前剪枝,当决策树达到限定深度时,就不再进行分裂了。这样就可以在一定程度上避免过拟合。

DRY(Do not Repeat Yourself),不要一个一个去试参数,构造参数范围,在范围内分别计算模型评分,并找出评分最高的模型所对应的参数。

np.argmax()
Returns the indices of the maximum values along an axis.

def cv_score(d):
    clf = DecisionTreeClassifier(max_depth=d)
    clf.fit(X_train, y_train)
    tr_score = clf.score(X_train, y_train)
    cv_score = clf.score(X_test, y_test)
    return (tr_score, cv_score)

depths = range(2, 15)
scores = [cv_score(d) for d in depths]
tr_scores = [s[0] for s in scores]
cv_scores = [s[1] for s in scores]

best_score_index = np.argmax(cv_scores)
best_score = cv_scores[best_score_index]
best_param = depths[best_score_index]
print('best param: {0}; best score: {1}'.format(best_param, best_score))

plt.figure(figsize=(10, 6), dpi=144)
plt.grid()
plt.xlabel('max depth of decision tree')
plt.ylabel('score')
plt.plot(depths, cv_scores, '.g-', label='cross-validation score')
plt.plot(depths, tr_scores, '.r--', label='training score')
plt.legend()

best param: 5; best score: 0.8435754189944135

<matplotlib.legend.Legend at 0x2091fc70278>

在这里插入图片描述

使用同样的方法考察参数 min_impurity_split,这个参数指定信息熵或基尼不纯度的阈值,决策树分裂后,其信息增益低于这个阈值时,则不再分裂。

# 训练模型,并计算评分
def cv_score(val):
    clf = DecisionTreeClassifier(criterion='gini', min_impurity_decrease=val)
    #此处将criterion='gini'改成'entropy'结果差不多
    clf.fit(X_train, y_train)
    tr_score = clf.score(X_train, y_train)
    cv_score = clf.score(X_test, y_test)
    return (tr_score, cv_score)

# 指定参数范围,分别训练模型,并计算评分
values = np.linspace(0, 0.005, 50)
scores = [cv_score(v) for v in values]
tr_scores = [s[0] for s in scores]
cv_scores = [s[1] for s in scores]

# 找出评分最高的模型参数
best_score_index = np.argmax(cv_scores)
best_score = cv_scores[best_score_index]
best_param = values[best_score_index]
print('best param: {0}; best score: {1}'.format(best_param, best_score))

# 画出模型参数与模型评分的关系
plt.figure(figsize=(10, 6), dpi=144)
plt.grid()
plt.xlabel('threshold of entropy')
plt.ylabel('score')
plt.plot(values, cv_scores, '.g-', label='cross-validation score')
plt.plot(values, tr_scores, '.r--', label='training score')
plt.legend()

best param: 0.0012244897959183673; best score: 0.8212290502793296

<matplotlib.legend.Legend at 0x209200fdba8>

在这里插入图片描述

sklearn.model_selection.GridSearchCV:
Exhaustive search over specified parameter values for an estimator

def plot_curve(train_sizes, cv_results, xlabel):
    train_scores_mean = cv_results['mean_train_score']
    train_scores_std = cv_results['std_train_score']
    test_scores_mean = cv_results['mean_test_score']
    test_scores_std = cv_results['std_test_score']
    plt.figure(figsize=(10, 6), dpi=144)
    plt.title('parameters turning')
    plt.grid()
    plt.xlabel(xlabel)
    plt.ylabel('score')
    plt.fill_between(train_sizes, 
                     train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, 
                     alpha=0.1, color="r")
    plt.fill_between(train_sizes, 
                     test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, 
                     alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, '.--', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, '.-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")

from sklearn.model_selection import GridSearchCV

thresholds = np.linspace(0, 0.005, 50)
# Set the parameters by cross-validation
param_grid = {'min_impurity_decrease': thresholds}

clf = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5, return_train_score=True)
clf.fit(X, y)
print("best param: {0}\nbest score: {1}".format(clf.best_params_, 
                                                clf.best_score_))

plot_curve(thresholds, clf.cv_results_, xlabel='gini thresholds')

best param: {‘min_impurity_decrease’: 0.0012244897959183673}

best score: 0.8114478114478114

在这里插入图片描述

将 max_depth 和 min_impurity_decrease 两个参数结合起来,同时观察对训练的影响以得到最优参数。

sklearn.tree.DecisionTreeClassifier

min_samples_split : int, float, optional (default=2)
The minimum number of samples required to split an internal node:

from sklearn.model_selection import GridSearchCV

entropy_thresholds = np.linspace(0, 0.01, 50)
gini_thresholds = np.linspace(0, 0.005, 50)

# Set the parameters by cross-validation
param_grid = [{'max_depth': range(2, 10)},
              {'criterion': ['entropy'], 'min_impurity_decrease': entropy_thresholds},
              {'criterion': ['gini'], 'min_impurity_decrease': gini_thresholds},
              {'min_samples_split': range(2, 30, 2)}]

clf = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5, return_train_score=True)
clf.fit(X, y)
print("best param: {0}\nbest score: {1}".format(clf.best_params_, 
                                                clf.best_score_))

best param: {‘criterion’: ‘entropy’, ‘min_impurity_decrease’: 0.002857142857142857}

best score: 0.8226711560044894

为什么 max_depth 和 min_samples_split 都不返回最优参数呢?

clf = DecisionTreeClassifier(criterion='entropy', min_impurity_decrease=0.002857142857142857)
clf.fit(X_train, y_train)
train_score = clf.score(X_train, y_train)
test_score = clf.score(X_test, y_test)
print('train score: {0}; test score: {1}'.format(train_score, test_score))

train score: 0.8890449438202247; test score: 0.8100558659217877

导出 titanic.dot 文件:

with open("titanic.dot", 'w') as f:
    f = export_graphviz(clf, out_file=f)

# 1. 在电脑上安装 graphviz
# 2. 运行 `dot -Tpng titanic.dot -o titanic.png` 
# 3. 在当前目录查看生成的决策树 titanic.png

关于dot -Tpng titanic.dot -o titanic.png运行失败:

>>>cd C:\Users\Qiuyi\Desktop\titanic
>>>python3
>>>import graphviz
>>>dot -Tpng titanic.dot -o titanic.png

SyntaxError: invalid syntax
‘dot’ 不是内部或外部命令,也不是可运行的程序 或批处理文件。

解决办法:

首先要安装 graphviz,不是pip3 install graphviz,而是下载 graphviz-2.38.msi 安装包,安装在电脑里,还要将安装目录内的bin文件夹地址添加到path里:

打开控制面板—系统与安全—-系统—-高级系统设置—-高级—–环境变量—-在系统变量中找到Path—-点击编辑—–新建——粘贴地址——保存

分析决策树:

参考知乎:Python决策树模型做 Titanic数据集预测并可视化
在这里插入图片描述

可以看到根节点判断的条件是sex,sex<0.5为true,则因为sex只有0和1,分别代表female和male,就是true则为女性,false为男性。

最后Value中的两个数字左边代表在这些样本中的死亡数,右边的是幸存数。根据这一点结合上面的决策树就可以看到,首先利用性别,划分了女性253(根据Values得知其中62死,191幸存),男性459人(根据Value得知371死,88生),基本上Titanic上的男性在面对生离死别时还是很考虑到女性的~总共712人是 train.csv 分成训练集后的人数。

LabelEncoder()

from sklearn.preprocessing import LabelEncoder
df=pd.read_csv(r'C:\Users\Qiuyi\Desktop\titanic\train.csv')
le = LabelEncoder()
le.fit(df['Sex'])
#用离散值转化标签值
df['Sex']=le.transform(df['Sex'])
print(df.Sex)

0 1
1 0
2 0

889 1
890 1
Name: Sex, Length: 891, dtype: int64

预测测试集并上传打分

X_test = read_dataset(r'C:\Users\Qiuyi\Desktop\titanic\test.csv')
y_predict = clf.predict(X_test)
X_test.head()

在这里插入图片描述

X_test.insert(0,'Survived',y_predict)

final=X_test.iloc[:,0:1]
final.to_csv(r'C:\Users\Qiuyi\Desktop\titanic\PredictTitanic.csv',index=True)

在这里插入图片描述

直接提交只有0.73205,哈哈哈。

在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/weixin_34275246/article/details/85195348