Article directory
I. Introduction
- Mission objective: According to the personal information on Titanic's boarding personnel list, predict whether he will survive
- Dataset: "List of Boarding Personnel on the Titanic", from https://download.csdn.net/download/weixin_43721000/87740848
- Data set explanation:
the first column age indicates the age (numeric data)
the second column cabin indicates the cabin number (categorical data: string type)
the third column embarked indicates the boarding port, S is Southampton, C It is Cherbourg, France, and Q is Queenstown, Ireland (categorical data: direct category)
. The fourth column fare indicates the ticket price (numeric data)
. The fifth column name indicates the name (categorical data: string type)
The sixth column parch, the number of parents/children on board, the number of immediate family members of different generations, for example, someone is on the same boat with his daughter and father, then his value is the number of parents (1) + number of children (1) =2
The seventh column, passengerId, indicates the boarding number;
the eighth column, pclass, indicates the class of passenger cabin. There are three classes here, 1 is the first class, 2 is the second class, and 3 is the third class (categorical data: direct category
) Nine columns of sex, indicating gender male is male, famale is female (categorical data)
tenth column sibsp, indicating the number of brothers and sisters/spouse, the number of immediate family members of the same generation, for example, someone is on this boat with his brother and wife , then his value is the number of brothers and sisters (1) + number of spouses (1) = 2
The eleventh column is survived, indicating whether it is alive or not, 1 is survival, 2 is death (categorical data: direct category),
the twelfth column is ticket, Indicates the ticket number (numeric data)
2. Implementation method
1. Read the dataset
import numpy as np
import pandas as pd
dataset = pd.read_csv('/kaggle/input/titanic/train.csv')
X_test = pd.read_csv('/kaggle/input/titanic/test.csv')
print(dataset)
# PassengerId Survived Pclass \
# 0 1 0 3
# 1 2 1 1
# 2 3 1 3
# 3 4 1 1
# 4 5 0 3
# .. ... ... ...
# 886 887 0 2
# 887 888 1 1
# 888 889 0 3
# 889 890 1 1
# 890 891 0 3
#
# Name Sex Age SibSp \
# 0 Braund, Mr. Owen Harris male 22.0 1
# 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
# 2 Heikkinen, Miss. Laina female 26.0 0
# 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
# 4 Allen, Mr. William Henry male 35.0 0
# .. ... ... ... ...
# 886 Montvila, Rev. Juozas male 27.0 0
# 887 Graham, Miss. Margaret Edith female 19.0 0
# 888 Johnston, Miss. Catherine Helen "Carrie" female NaN 1
# 889 Behr, Mr. Karl Howell male 26.0 0
# 890 Dooley, Mr. Patrick male 32.0 0
#
# Parch Ticket Fare Cabin Embarked
# 0 0 A/5 21171 7.2500 NaN S
# 1 0 PC 17599 71.2833 C85 C
# 2 0 STON/O2. 3101282 7.9250 NaN S
# 3 0 113803 53.1000 C123 S
# 4 0 373450 8.0500 NaN S
# .. ... ... ... ... ...
# 886 0 211536 13.0000 NaN S
# 887 0 112053 30.0000 B42 S
# 888 2 W./C. 6607 23.4500 NaN S
# 889 0 111369 30.0000 C148 C
# 890 0 370376 7.7500 NaN Q
#
# [891 rows x 12 columns]
print(X_test)
# PassengerId Pclass Name \
# 0 892 3 Kelly, Mr. James
# 1 893 3 Wilkes, Mrs. James (Ellen Needs)
# 2 894 2 Myles, Mr. Thomas Francis
# 3 895 3 Wirz, Mr. Albert
# 4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist)
# .. ... ... ...
# 413 1305 3 Spector, Mr. Woolf
# 414 1306 1 Oliva y Ocana, Dona. Fermina
# 415 1307 3 Saether, Mr. Simon Sivertsen
# 416 1308 3 Ware, Mr. Frederick
# 417 1309 3 Peter, Master. Michael J
#
# Sex Age SibSp Parch Ticket Fare Cabin Embarked
# 0 male 34.5 0 0 330911 7.8292 NaN Q
# 1 female 47.0 1 0 363272 7.0000 NaN S
# 2 male 62.0 0 0 240276 9.6875 NaN Q
# 3 male 27.0 0 0 315154 8.6625 NaN S
# 4 female 22.0 1 1 3101298 12.2875 NaN S
# .. ... ... ... ... ... ... ... ...
# 413 male NaN 0 0 A.5. 3236 8.0500 NaN S
# 414 female 39.0 0 0 PC 17758 108.9000 C105 C
# 415 male 38.5 0 0 SOTON/O.Q. 3101262 7.2500 NaN S
# 416 male NaN 0 0 359309 8.0500 NaN S
# 417 male NaN 1 1 2668 22.3583 NaN C
#
# [418 rows x 11 columns]
2. Data cleaning
- cleaning appellation
# 从 Name 字段中提取称谓,并将较少的称谓统一替换为 ‘rare’ 减少数据噪声
dataset_title = [i.split(',')[1].split('.')[0].strip() for i in dataset['Name']] # 提取称谓
dataset['Title'] = pd.Series(dataset_title) # 插入新列
print(dataset['Title'].value_counts())
# # Mr 517
# Miss 182
# Mrs 125
# Master 40
# Dr 7
# Rev 6
# Mlle 2
# Major 2
# Col 2
# the Countess 1
# Capt 1
# Ms 1
# Sir 1
# Lady 1
# Mme 1
# Don 1
# Jonkheer 1
# Name: Title, dtype: int64
# 查看称谓数量分布
dataset['Title'] = dataset['Title'].replace(['Lady', 'the Countess', 'Countess', 'Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona', 'Ms', 'Mme', 'Mlle'], 'Rare') # 替换数量较少的称谓为 ‘rare’
print(dataset['Title'].value_counts())
# Mr 517
# Miss 182
# Mrs 125
# Master 40
# Rare 27
# Name: Title, dtype: int64
dataset_title = [i.split(',')[1].split('.')[0].strip() for i in X_test['Name']]
X_test['Title'] = pd.Series(dataset_title)
print(X_test['Title'].value_counts())
# Mr 240
# Miss 78
# Mrs 72
# Master 21
# Col 2
# Rev 2
# Ms 1
# Dr 1
# Dona 1
# Name: Title, dtype: int64
X_test['Title'] = X_test['Title'].replace(['Lady', 'the Countess', 'Countess', 'Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona', 'Ms', 'Mme', 'Mlle'], 'Rare')
print(X_test['Title'].value_counts())
# Mr 240
# Miss 78
# Mrs 72
# Master 21
# Rare 7
# Name: Title, dtype: int64
- Number of family members cleaned
# 从 SibSp(同代家属:兄弟姐妹或配偶)和 Parch(不同代家属:父母或子女)中提取上船的家庭人数=SibSp+Parch+1,并将家庭人数标签化
dataset['FamilyS'] = dataset['SibSp'] + dataset['Parch'] + 1
X_test['FamilyS'] = X_test['SibSp'] + X_test['Parch'] + 1
def family(x):
if x < 2:
return 'Single'
elif x == 2:
return 'Couple'
elif x <= 4:
return 'InterM'
else:
return 'Large'
dataset['FamilyS'] = dataset['FamilyS'].apply(family)
X_test['FamilyS'] = X_test['FamilyS'].apply(family)
print(dataset['FamilyS'].value_counts())
# Single 537
# Couple 161
# InterM 131
# Large 62
# Name: FamilyS, dtype: int64
- Missing data filling
# 将 登船港口列 缺失的值替换为该列的众数(如有多个众数取第一个)
dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace=True)
X_test['Embarked'].fillna(X_test['Embarked'].mode()[0], inplace=True)
# 将 年龄列 缺失的值替换为该列的平均数
dataset['Age'].fillna(dataset['Age'].median(), inplace=True)
X_test['Age'].fillna(X_test['Age'].median(), inplace=True)
# 将 船票价格 缺失的值替换为该列的平均数
dataset['Fare'].fillna(dataset['Fare'].median(), inplace=True)
X_test['Fare'].fillna(X_test['Fare'].median(), inplace=True)
- Do not use data clearing
# 删掉不使用的列
dataset = dataset.drop(['PassengerId', 'Cabin', 'Name', 'SibSp', 'Parch', 'Ticket'], axis=1)
X_test_passengers = X_test['PassengerId']
X_test = X_test.drop(['PassengerId', 'Cabin', 'Name', 'SibSp', 'Parch', 'Ticket'], axis=1)
print(dataset)
# Survived Pclass Sex Age Fare Embarked Title FamilyS
# 0 0 3 male 22.0 7.2500 S Mr Couple
# 1 1 1 female 38.0 71.2833 C Mrs Couple
# 2 1 3 female 26.0 7.9250 S Miss Single
# 3 1 1 female 35.0 53.1000 S Mrs Couple
# 4 0 3 male 35.0 8.0500 S Mr Single
# .. ... ... ... ... ... ... ... ...
# 886 0 2 male 27.0 13.0000 S Rare Single
# 887 1 1 female 19.0 30.0000 S Miss Single
# 888 0 3 female 28.0 23.4500 S Miss InterM
# 889 1 1 male 26.0 30.0000 C Mr Single
# 890 0 3 male 32.0 7.7500 Q Mr Single
3. Divide the training set, validation set and test set
- Sample and Label Separation
# 拆分训练集的样本和标签
X_train = dataset.iloc[:, 1:9].values
Y_train = dataset.iloc[:, 0].values
# 测试集只有样本没有标签
X_test = X_test.values
- Convert text labels to one-hot encoding
# 文本标签直接转为 one-hot 编码
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
print(X_train[0], X_train[0].shape)
# [3 'male' 22.0 7.25 'S' 'Mr' 'Couple'] (7,)
one_hot_encoder = ColumnTransformer(
[(
'one_hot_encoder', # 转换器名称(任意)
OneHotEncoder(categories='auto'), # 编码类型
[0, 1, 4, 5, 6] # 对哪些列进行编码
)],
remainder='passthrough' # 保留未编码的列
)
X_train = one_hot_encoder.fit_transform(X_train).tolist()
X_test = one_hot_encoder.fit_transform(X_test).tolist()
print(X_train[0], len(X_train[0]))
# [0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 22.0, 7.25] 19
# 7列变为19列的稀疏矩阵
- Split 1/10 of the data from the training set as a validation set
# 拆分训练集和验证集
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(X_train, Y_train, test_size = 0.1)
print(len(x_train))
# 801
print(len(x_val))
# 90
print(len(y_train))
# 801
print(len(y_val))
# 90
4. Create a model
# 创建神经网络
import torch
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(19, 270) # 上采样到270
self.fc2 = nn.Linear(270, 2) # 2分类
def forward(self, x):
x = self.fc1(x)
x = F.dropout(x, p=0.1)
x = F.elu(x)
x = self.fc2(x)
# x = torch.sigmoid(x) # 输出结果映射到(0,1)之间
return x
net = Net()
5. Specify training parameters
# 训练参数
batch_size = 50
num_epochs = 50
learning_rate = 0.01
batch_no = len(x_train) // batch_size
6. Define loss function, optimizer
# 损失函数、优化器
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate)
7. Training
# 训练
from sklearn.utils import shuffle
from torch.autograd import Variable
for epoch in range(num_epochs):
if epoch % 5 == 0:
print('Epoch {}'.format(epoch+1))
x_train, y_train = shuffle(x_train, y_train) # 乱序
# Mini batch learning
for i in range(batch_no):
start = i * batch_size
end = start + batch_size
x_var = Variable(torch.FloatTensor(x_train[start:end]))
y_var = Variable(torch.LongTensor(y_train[start:end]))
# Forward + Backward + Optimize
optimizer.zero_grad()
ypred_var = net(x_var)
loss =criterion(ypred_var, y_var)
loss.backward()
optimizer.step()
8. Verification accuracy
# 验证准确率
test_var = Variable(torch.FloatTensor(x_val), requires_grad=True)
with torch.no_grad():
result = net(test_var)
# values, labels = torch.max(result, 1)
# print(values, labels)
labels = torch.argmax(result, dim=1)
print(labels)
# tensor([1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0,
# 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
# 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1])
num_right = np.sum(labels.data.numpy() == y_val)
print('Accuracy {:.2f}'.format(num_right / len(y_val)))
# Accuracy 0.81
9. Forecast
# 预测
X_test_var = Variable(torch.FloatTensor(X_test), requires_grad=True)
with torch.no_grad():
test_result = net(X_test_var)
# print(test_result)
values, labels = torch.max(test_result, 1)
survived = labels.data.numpy()
print(f"预测结果:{
survived}")
# 预测结果:[0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 0 1
# 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 1 0
# 1 1 0 1 0 1 1 0 0 0 0 0 1 1 1 1 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
# 1 1 1 1 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0
# 1 0 1 0 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1
# 0 1 0 0 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0
# 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 1
# 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0
# 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
# 1 0 0 0 0 0 1 0 0 0 1 1 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0
# 0 1 0 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0
# 1 1 1 1 1 1 0 1 0 0 1]