Level 15 Predict the survival of Titanic passengers_Artificial Intelligence Course-Little Elephant Academy

Course Catalog Elephant Institute - AI

Follow the public account [Python Family] to receive a set of 1024G textbooks , exchange group learning , and business cooperation . I have sorted out and shared several sets of teaching materials from four-digit training institutions , and now share and exchange learning for free, and provide answers and exchange groups .

There may be all the tutorials on prostitution you want here~

 

Overview of this level

 

Welcome to the study of this level. This level first introduces the use of decision trees in sklearn and the influence of several important parameters on the model; then the decision tree model is used to predict the survival of Titanic passengers. Are you ready? Let's go!

Let's take a brief look at the introduction to the decision tree model in sklearn's official document, the link is https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html.

As you can see, the DecisionTreeClassifier class is located in the sklearn.tree package, and there are many hyperparameters that can be adjusted. Next, we first introduce the use of the decision tree model in sklearn through a simple example, and introduce the adjustment of several typical parameters.

 

Decision tree in sklearn

 

We use sklearn's own functions to generate some simulated data. For the sake of intuition, we display the data as a scatter plot:

AI_15_0_1_Use the make_moons function to generate some simulated data

# 使用make_moons函数生成一些模拟的数据
from sklearn.datasets import make_moons
X, y = make_moons(noise=0.25, random_state=666)

%matplotlib inline
import matplotlib.pyplot as plt

plt.scatter(X[y==0, 0], X[y==0, 1])
plt.scatter(X[y==1, 0], X[y==1, 1])
plt.show();
y

array([1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0,
       0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0])

Next, we will use the decision tree algorithm to classify this data set, and observe the final classification effect by adjusting different parameters.

AI_15_0_2

from sys import path
path.append(r"../data/course_util")
from ai_course_15_1 import *

# 引入决策树分类器
from sklearn.tree import DecisionTreeClassifier

# 使用默认参数实例化一个决策树分类器
dt_clf = DecisionTreeClassifier()

# 训练数据
dt_clf.fit(X, y)

 

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

Here in the above program, we have instantiated a decision tree classifier without passing in any parameters, which means that we use the default parameters to train the data.

 

In order to draw the classification boundary obtained by training more intuitively, we have prepared a plot_decision_boundary function in advance , and you can call it directly:

AI_15_0_3

from sys import path
path.append(r"../data/course_util")
from ai_course_15_2 import *
%matplotlib inline

# 传入训练好的模型dt_clf,axis参数控制横纵坐标的范围
plot_decision_boundary(dt_clf, axis=[-1.5, 2.5, -1.0, 2.5])

plt.scatter(X[y==0, 0], X[y==0, 1])
plt.scatter(X[y==1, 0], X[y==1, 1])
plt.show();

Let's analyze the classification boundary of the output of the above program:

  • First of all, through observation, we found that we have got an irregular decision boundary, but each dividing line can only be "horizontal and vertical", which is also a shortcoming of the decision tree model;
  • Secondly, it can be seen that the current decision boundary is not very ideal for data fitting.

 

max_depth parameter

 

Let's first adjust the max_depth parameter, which controls the maximum depth of the decision tree.

AI_15_0_4

from sys import path
path.append(r"../data/course_util")
from ai_course_15_3 import *
%matplotlib inline

# 使用max_depth参数实例化一个决策树分类器
dt_clf2 = DecisionTreeClassifier(max_depth=2)
dt_clf2.fit(X, y)

plot_decision_boundary(dt_clf2, axis=[-1.5, 2.5, -1.0, 2.5])
plt.scatter(X[y==0, 0], X[y==0, 1])
plt.scatter(X[y==1, 0], X[y==1, 1])
plt.show();

Let's analyze the output of the program:

  • Comparing with the decision boundary obtained by using the default parameters above, it can be found that the decision boundary obtained earlier is more complicated and the classification accuracy is higher, but the model fits the data excessively. In other words, the model is applied to new data. The effect is not necessarily good;
  • The decision boundary obtained by using the max_depth parameter is simpler, but the number of misclassified samples increases significantly.

 

min_samples_split parameter

 

Let's look at another parameter: min_samples_split , the meaning of this parameter is that a node of the decision tree must contain at least min_samples_split samples before the division can continue.

AI_15_0_5

from sys import path
path.append(r"../data/course_util")
from ai_course_15_4 import *
%matplotlib inline

# 使用min_samples_split参数实例化一个决策树分类器
dt_clf3 = DecisionTreeClassifier(min_samples_split=10)
dt_clf3.fit(X, y)

plot_decision_boundary(dt_clf3, axis=[-1.5, 2.5, -1.0, 2.5])
plt.scatter(X[y==0, 0], X[y==0, 1])
plt.scatter(X[y==1, 0], X[y==1, 1])
plt.show();

Judging from the output results, this decision boundary is between the two classification boundaries obtained previously. It neither overfits the data nor divides the data too simply. It is an ideal decision boundary.

 

sklearn also provides us with many adjustable parameters, such as:

  • min_samples_leaf
  • max_leaf_nodes
  • min_weight_fraction_leaf
  • min_features

 

We will not demonstrate one by one here. In practical applications, we can use techniques such as grid search to select a set of optimal hyperparameters to obtain an optimal decision tree.

 

Case: Predicting the survival of passengers

 

Next, we will use the decision tree algorithm to solve a practical problem: predict the survival of Titanic passengers.

 

In 1912, during its first voyage, the Titanic accidentally hit an iceberg and sank, causing a historic tragedy. In the subsequent accident investigation, the information of the passengers on the ship was gradually disclosed. Our task is to analyze the data and try to find out the survival logic of the passengers in the accident.

 

Get data and field meaning

 

Let’s take a look at the passenger’s data first: the data can be found at https://chinahadoop-xc.oss-cn-beijing.aliyuncs.com/titanic.txt , and can be read directly using the read_csv() function in the pandas package Get the data in this file, the procedure is as follows:

AI_15_0_6_Import pandas package

# 导入pandas包
import pandas as pd

# 从互联网上直接读取乘客数据
titanic = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')

# 观察乘客数据的前5行
titanic.head(5)
  row.names pclass survived name age embarked home.dest room ticket boat sex
0 1 1st 1 Allen, Miss Elisabeth Walton 29.0000 Southampton St Louis, MO B-5 24160 L221 2 female
1 2 1st 0 Allison, Miss Helen Loraine 2.0000 Southampton Montreal, PQ / Chesterville, ON C26 NaN NaN female
2 3 1st 0 Allison, Mr Hudson Joshua Creighton 30.0000 Southampton Montreal, PQ / Chesterville, ON C26 NaN (135) male
3 4 1st 0 Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 25.0000 Southampton Montreal, PQ / Chesterville, ON C26 NaN NaN female
4 5 1st 1 Allison, Master Hudson Trevor 0.9167 Southampton Montreal, PQ / Chesterville, ON C22 NaN 11 male

It is not difficult to find that the passenger data we get is a table, each row represents the information of a passenger, and each column represents a field. This kind of table is called a DataFrame in pandas .

 

And numpy in ndarray comparison, pandas in DataFrame in addition to data, each line added the line numbers, each column to increase the column name. This design is very convenient to use. For example, we can directly retrieve the data in the age column by using titanic['age'] .

 

Further, a packet-based numpy PANDAS established, therefore, the PANDAS DataFrame can be easily converted to the numpy ndarray , for example, by titanic.values get a numpy of ndarray . Through the previous study, we learned that the machine learning library sklearn deals with data of ndarray type. In fact, if we input data of DataFrame type, the sklearn library will automatically convert it to ndarray type, so we are completely Don't worry about the data type.

 

pandas provides many practical built-in methods, which are very suitable for completing the task of data analysis. Next, let us use the case of Titanic to feel the charm of data analysis using pandas!

 

Let's first introduce the meaning of each field in the read data:

  • row.names: the number of the passenger, which can be ignored;

  • pclass : the class of the cabin, 1st means first class, 2nd means second class, and 3rd means third class;

  • survived : whether to survive, 1 means yes, 0 means no;

  • name: the name of the passenger;

  • age : the age of the passenger, the age of the baby is expressed as a decimal;

  • embarked: the port of embarkation;

  • home.dest: origin and destination;

  • room: room number;

  • ticket: ticket number;

  • boat: lifeboat number;

  • sex : the gender of the passenger

 

Among the above fields, survived is the result we want to predict, and we use it as a label ; the rest of the fields can be regarded as features .

 

After getting these data, can we use these data to directly train our model?

 

Don't worry. Before we "feed" the data to the model, we must first make sure that the data is the "good" data that the model needs. In order to achieve this goal, we need to perform at least three operations on the data: feature selection , data cleaning, and data preprocessing .

 

Feature selection

 

Feature selection, as the name suggests, is to remove unimportant features and keep important features.

For this case, we can simply keep the three features of pclass , age, and sex based on our understanding of the accident , while the other features are not relevant to the final prediction result, so we discard them directly.

AI_15_0_7

from sys import path
path.append(r"../data/course_util")
from ai_course_15_5 import *

# 选取pclass,age,sex3个特征作为数据集X;将survived字段作为标签y
X = titanic[['pclass', 'age', 'sex']]
y = titanic['survived']

# 查看数据集X的维度信息
X.shape

# 查看数据集X最后5行的数据
X.tail(5)

# 查看数据集X的整体统计信息
X.info()

Let's analyze the three results of the program output:

  • The dimension of X is (1313, 3), which means that there are 1313 pieces of passenger information in the data set, and each piece of information contains 3 features;
  • Observing the last 5 rows of data in X, we find that there are a lot of NaN (Not a Number) in the age feature, that is, there are many missing values ​​in the age feature;
  • Observing the information statistics of X, we focus on the age feature. There are only 633 non-empty values ​​in age. This also shows that there are many missing values ​​in the age feature, which need to be processed in the next step.

 

Data cleaning

 

Data cleaning usually includes the deletion of outliers, the filling of missing values, and the correction of erroneous values.

In this case, we need to fill in the missing values ​​in the age feature. Here we use the "average value" to fill in, which is divided into two steps:

  • First use the mean() method to calculate the average value of the age feature

  • Then use the fillna() method to replace the NaN value in age with the average value

The procedure is as follows:

AI_15_0_8

from sys import path
path.append(r"../data/course_util")
from ai_course_15_6 import *

# 计算age特征的平均值
mean_age = X['age'].mean()
print(mean_age)

# 将NaN值替换成平均值
X['age'] = X['age'].fillna(mean_age)

# 查看X的最后5行数据
X.tail(5)
31.19418104265403
  pclass age sex
1308 3rd 31.194181 male
1309 3rd 31.194181 male
1310 3rd 31.194181 male
1311 3rd 31.194181 female
1312 3rd 31.194181 male

It can be seen from the output of the program that the NaN value in the age feature has been replaced with the average value.

 

Data preprocessing

 

After the previous processing, the current data set looks much more "beautiful"! Before we start training the model, please think again. What other issues need us to deal with?

 

There are at least two issues we need to deal with:

  • The pclass and sex features are string types, which must be converted to numeric types before they can be substituted into the model for calculation. We can use the map() method to achieve the conversion between text and numeric values;

  • The value ranges of pclass, age, and sex features are quite different. You need to normalize the values ​​of these three features to between 0 and 1. This kind of processing has occurred many times in previous cases, I believe you This kind of treatment will not be unfamiliar.

AI_15_0_9

from sys import path
path.append(r"../data/course_util")
from ai_course_15_7 import *

# 将'lst'转换为1,'2nd'转换为2,'3rd'转换为3
# 将'female'转换为0,'male'转换为1 
X['pclass'] = X['pclass'].map({'1st':1, '2nd':2, '3rd':3})
X['sex'] = X['sex'].map({'female':0, 'male':1})

# 查看X的最后5行数据
print(X.tail(5))

# 将所有特征都归一化到0到1之间
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

print(X_scaled) 
X_scaled
pclass        age  sex
1308     NaN  31.194181  NaN
1309     NaN  31.194181  NaN
1310     NaN  31.194181  NaN
1311     NaN  31.194181  NaN
1312     NaN  31.194181  NaN
[[       nan 0.40705854        nan]
 [       nan 0.02588189        nan]
 [       nan 0.4211762         nan]
 ...
 [       nan 0.43803523        nan]
 [       nan 0.43803523        nan]
 [       nan 0.43803523        nan]]
array([[       nan, 0.40705854,        nan],
       [       nan, 0.02588189,        nan],
       [       nan, 0.4211762 ,        nan],
       ...,
       [       nan, 0.43803523,        nan],
       [       nan, 0.43803523,        nan],
       [       nan, 0.43803523,        nan]])

 

After the above three steps of processing, we finally get the data we want: X_scaled and y . From here we can see that in the entire machine learning, the data preparation step often requires a lot of time and energy, and its importance is not Inferior to the training and adjustment of the model.

 

Training model

 

Next, we use the decision tree classifier in the sklearn library for training.

AI_15_0_10

from sys import path
path.append(r"../data/course_util")
from ai_course_15_8 import *

# 将数据划分为训练集和测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=100)

# 从sklearn中导入决策树分类器
from sklearn.tree import DecisionTreeClassifier

# 使用默认参数初始化决策树分类器
dt_clf = DecisionTreeClassifier()

# 使用训练集来训练决策树模型
dt_clf.fit(X_train, y_train)

# 计算该模型对测试集的预测准确率
dt_clf.score(X_test, y_test)
0.779467680608365
从程序的输出结果可以看出,我们使用默认参数训练的决策树模型在测试集上的预测准确率约为78%,看上去还不错。现在我们希望这个预测的准确率能提高到80%以上,你能想到什么好办法吗?

 

Tuning

 

We can try to use different hyperparameters to build a decision tree model, which is the grid search method we learned earlier ! Below we directly call the GridSearchCV class in sklearn to search for the optimal parameter configuration.

AI_15_0_11

from sys import path
path.append(r"../data/course_util")
from ai_course_15_8 import *

# 使用默认参数初始化决策树分类器
dt_clf = DecisionTreeClassifier()

# 从sklearn中导入网格搜索GridSearchCV类
from sklearn.model_selection import GridSearchCV

# 配置一下需要搜索的参数名称和取值范围,用字典进行保存
params = {'max_depth':[2,3,4,5], 'min_samples_split':[2,4,6,8,10]}

# 将决策树分类器和需要搜索的参数传给GridSearchCV,它将自动为我们寻找出最优参数组合
dt_best = GridSearchCV(dt_clf, params)
dt_best.fit(X_train, y_train)

# 将寻找到的最优参数组合打印出来
print(dt_best.best_params_)
{'max_depth': 2, 'min_samples_split': 2}

In the above program, we searched the two hyperparameters of max_depth and min_samples_split respectively . Among them, the four values ​​of 2, 3, 4, and 5 were searched for max_depth, and 2, 4, 6, 8 were searched for min_samples_split. , 10 these five values. The results of the grid search show that max_depth=2 and min_samples_split=2 are a set of optimal parameter combinations. Let us quickly write the program and try it out:

AI_15_0_12

from sys import path
path.append(r"../data/course_util")
from ai_course_15_8 import *

# 使用默认参数初始化决策树分类器
dt_clf = DecisionTreeClassifier(max_depth=2, min_samples_split=2)

# 使用训练集来训练决策树模型
dt_clf.fit(X_train, y_train)

# 计算该模型对测试集的预测准确率
dt_clf.score(X_test, y_test)
0.8403041825095057

The decision tree model that we reconstructed using the parameters obtained by the grid search can get a prediction accuracy of about 84% on the test set. It seems that the power of the grid search cannot be underestimated!

 

Use models to mine data

 

Now that we have trained a decision tree model with good results, we will use this model to do some interesting explorations:

 

Predict the survival probability of the hero and heroine

 

In the Titanic movie, the male protagonist Jack and the female protagonist Rose are two fictitious characters. According to the plot of the movie, we assume that:

  • Jack lives in third-class, he is 23 years old, of course his gender is male
  • Rose lives in first class, age is 20 years old, of course her gender is female

 

From this, we can construct the data of Jack and Rose as follows:

  1. import numpy as np
  2.  
  3. jack = np.array([[3, 23, 1]])
  4. rose = np.array([[1, 20, 0]])

Therefore, we can use the model we trained to predict the probability of surviving the hero and heroine :

AI_15_0_13

from sys import path
path.append(r"../data/course_util")
from ai_course_15_9 import *

# 将所有特征都归一化到0到1之间
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

# 构建男女主角的数据,并进行归一化处理
import numpy as np
jack = np.array([[3, 23, 1]])
rose = np.array([[1, 20, 0]])

jack_scaled = scaler.transform(jack)
rose_scaled = scaler.transform(rose)

# 将数据划分为训练集和测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=100)

# 使用默认参数初始化决策树分类器
dt_clf = DecisionTreeClassifier(max_depth=2, min_samples_split=2)

# 使用训练集来训练决策树模型
dt_clf.fit(X_train, y_train)

# 预测男女主角的生还概率
dt_clf.predict_proba(jack_scaled)[0][1]
print(dt_clf.predict_proba(rose_scaled))
[[0.11055276 0.88944724]]

From the output of the program, it can be seen that the survival probability of the male protagonist Jack is only about 15.6% , while the survival probability of the heroine is as high as 88.9% , which is in line with the ending of the movie.

 

Sample analysis of model prediction errors

 

We are very curious about those people whose model predictions are wrong: Why does the model predict that a passenger has a high probability of surviving, but does not survive in reality? Is it because our model is not accurate enough, or is there something hidden behind it?

 

In order to find out the story behind these cold data, we can do this:

  • First use the model to calculate the probability of everyone surviving
  • Merge the column of survival probability data into the original data
  • Analyze the data of passengers whose models predict a high probability of survival but have not survived in practice

AI_15_0_14

from sys import path
path.append(r"../data/course_util")
from ai_course_15_8 import *

# 使用默认参数初始化决策树分类器
dt_clf = DecisionTreeClassifier(max_depth=2, min_samples_split=2)

# 使用训练集来训练决策树模型
dt_clf.fit(X_train, y_train)

# 计算出所有人生存下来的概率
prob = dt_clf.predict_proba(X_scaled)[:,1]

# 将生存概率添加到原表数据中
titanic['probability'] = prob

# 查看Allison家族的人的信息
titanic[1:5]
  row.names pclass survived name age embarked home.dest room ticket boat sex probability
1 2 1st 0 Allison, Miss Helen Loraine 2.0000 Southampton Montreal, PQ / Chesterville, ON C26 NaN NaN female 0.889447
2 3 1st 0 Allison, Mr Hudson Joshua Creighton 30.0000 Southampton Montreal, PQ / Chesterville, ON C26 NaN (135) male 0.156342
3 4 1st 0 Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 25.0000 Southampton Montreal, PQ / Chesterville, ON C26 NaN NaN female 0.889447
4 5 1st 1 Allison, Master Hudson Trevor 0.9167 Southampton Montreal, PQ / Chesterville, ON C22 NaN 11 male 1.000000

From the output of the program, the passengers in the first and third lines have a high probability of survival, but they did not survive in practice. Why?

 

The real situation is this: these 4 records are the information of the Allison family, namely father (30 years old), mother (25 years old), daughter (2 years old) and a baby boy under one year old.

 

由于在救援时遵守女士优先的原则,因此原本妈妈可以带着女儿和小婴儿坐上救生艇离开,但是因为当时突然找不到小婴儿,于是妈妈坚持不肯上救生艇,在船上寻找婴儿;然而命运总是爱捉弄人,原来婴儿早就被一个护士带上了救生艇离开,但慌乱中忘记通知它的家人,从而造成这个悲伤的故事。

 

本关总结:在本关,我们先介绍了sklearn中决策树模型的使用,然后利用决策树实现了对泰坦尼克号乘客生还情况的预测。

 

今天我们就先学到这里吧,下一关见,拜拜~

 

联系我们,一起学Python吧

分享Python实战代码,入门资料,进阶资料,基础语法,爬虫,数据分析,web网站,机器学习,深度学习等等。


​关注公众号「Python家庭领取1024G整套教材交流群学习商务合作

Guess you like

Origin blog.csdn.net/qq_34409973/article/details/115210531