pytorch Titanic survivor prediction (two classification)


I. Introduction

  1. Mission objective: According to the personal information on Titanic's boarding personnel list, predict whether he will survive
  2. Dataset: "List of Boarding Personnel on the Titanic", from https://download.csdn.net/download/weixin_43721000/87740848
  3. Data set explanation:
    the first column age indicates the age (numeric data)
    the second column cabin indicates the cabin number (categorical data: string type)
    the third column embarked indicates the boarding port, S is Southampton, C It is Cherbourg, France, and Q is Queenstown, Ireland (categorical data: direct category)
    . The fourth column fare indicates the ticket price (numeric data)
    . The fifth column name indicates the name (categorical data: string type)
    The sixth column parch, the number of parents/children on board, the number of immediate family members of different generations, for example, someone is on the same boat with his daughter and father, then his value is the number of parents (1) + number of children (1) =2
    The seventh column, passengerId, indicates the boarding number;
    the eighth column, pclass, indicates the class of passenger cabin. There are three classes here, 1 is the first class, 2 is the second class, and 3 is the third class (categorical data: direct category
    ) Nine columns of sex, indicating gender male is male, famale is female (categorical data)
    tenth column sibsp, indicating the number of brothers and sisters/spouse, the number of immediate family members of the same generation, for example, someone is on this boat with his brother and wife , then his value is the number of brothers and sisters (1) + number of spouses (1) = 2
    The eleventh column is survived, indicating whether it is alive or not, 1 is survival, 2 is death (categorical data: direct category),
    the twelfth column is ticket, Indicates the ticket number (numeric data)

2. Implementation method

1. Read the dataset

import numpy as np
import pandas as pd

dataset = pd.read_csv('/kaggle/input/titanic/train.csv')
X_test = pd.read_csv('/kaggle/input/titanic/test.csv')

print(dataset)
#      PassengerId  Survived  Pclass  \
# 0              1         0       3   
# 1              2         1       1   
# 2              3         1       3   
# 3              4         1       1   
# 4              5         0       3   
# ..           ...       ...     ...   
# 886          887         0       2   
# 887          888         1       1   
# 888          889         0       3   
# 889          890         1       1   
# 890          891         0       3   
# 
#                                                   Name     Sex   Age  SibSp  \
# 0                              Braund, Mr. Owen Harris    male  22.0      1   
# 1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
# 2                               Heikkinen, Miss. Laina  female  26.0      0   
# 3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
# 4                             Allen, Mr. William Henry    male  35.0      0   
# ..                                                 ...     ...   ...    ...   
# 886                              Montvila, Rev. Juozas    male  27.0      0   
# 887                       Graham, Miss. Margaret Edith  female  19.0      0   
# 888           Johnston, Miss. Catherine Helen "Carrie"  female   NaN      1   
# 889                              Behr, Mr. Karl Howell    male  26.0      0   
# 890                                Dooley, Mr. Patrick    male  32.0      0   
# 
#      Parch            Ticket     Fare Cabin Embarked  
# 0        0         A/5 21171   7.2500   NaN        S  
# 1        0          PC 17599  71.2833   C85        C  
# 2        0  STON/O2. 3101282   7.9250   NaN        S  
# 3        0            113803  53.1000  C123        S  
# 4        0            373450   8.0500   NaN        S  
# ..     ...               ...      ...   ...      ...  
# 886      0            211536  13.0000   NaN        S  
# 887      0            112053  30.0000   B42        S  
# 888      2        W./C. 6607  23.4500   NaN        S  
# 889      0            111369  30.0000  C148        C  
# 890      0            370376   7.7500   NaN        Q
#  
# [891 rows x 12 columns]

print(X_test)
#      PassengerId  Pclass                                          Name  \
# 0            892       3                              Kelly, Mr. James   
# 1            893       3              Wilkes, Mrs. James (Ellen Needs)   
# 2            894       2                     Myles, Mr. Thomas Francis   
# 3            895       3                              Wirz, Mr. Albert   
# 4            896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)   
# ..           ...     ...                                           ...   
# 413         1305       3                            Spector, Mr. Woolf   
# 414         1306       1                  Oliva y Ocana, Dona. Fermina   
# 415         1307       3                  Saether, Mr. Simon Sivertsen   
# 416         1308       3                           Ware, Mr. Frederick   
# 417         1309       3                      Peter, Master. Michael J   
# 
#         Sex   Age  SibSp  Parch              Ticket      Fare Cabin Embarked  
# 0      male  34.5      0      0              330911    7.8292   NaN        Q  
# 1    female  47.0      1      0              363272    7.0000   NaN        S  
# 2      male  62.0      0      0              240276    9.6875   NaN        Q  
# 3      male  27.0      0      0              315154    8.6625   NaN        S  
# 4    female  22.0      1      1             3101298   12.2875   NaN        S  
# ..      ...   ...    ...    ...                 ...       ...   ...      ...  
# 413    male   NaN      0      0           A.5. 3236    8.0500   NaN        S  
# 414  female  39.0      0      0            PC 17758  108.9000  C105        C  
# 415    male  38.5      0      0  SOTON/O.Q. 3101262    7.2500   NaN        S  
# 416    male   NaN      0      0              359309    8.0500   NaN        S  
# 417    male   NaN      1      1                2668   22.3583   NaN        C  
# 
# [418 rows x 11 columns]

2. Data cleaning

  1. cleaning appellation
# 从 Name 字段中提取称谓,并将较少的称谓统一替换为 ‘rare’ 减少数据噪声
dataset_title = [i.split(',')[1].split('.')[0].strip() for i in dataset['Name']]    # 提取称谓
dataset['Title'] = pd.Series(dataset_title)                                         # 插入新列
print(dataset['Title'].value_counts())
# # Mr              517
# Miss            182
# Mrs             125
# Master           40
# Dr                7
# Rev               6
# Mlle              2
# Major             2
# Col               2
# the Countess      1
# Capt              1
# Ms                1
# Sir               1
# Lady              1
# Mme               1
# Don               1
# Jonkheer          1
# Name: Title, dtype: int64

# 查看称谓数量分布
dataset['Title'] = dataset['Title'].replace(['Lady', 'the Countess', 'Countess', 'Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona', 'Ms', 'Mme', 'Mlle'], 'Rare')    # 替换数量较少的称谓为 ‘rare’
print(dataset['Title'].value_counts()) 
# Mr        517
# Miss      182
# Mrs       125
# Master     40
# Rare       27
# Name: Title, dtype: int64

dataset_title = [i.split(',')[1].split('.')[0].strip() for i in X_test['Name']]
X_test['Title'] = pd.Series(dataset_title)
print(X_test['Title'].value_counts())    
# Mr        240
# Miss       78
# Mrs        72
# Master     21
# Col         2
# Rev         2
# Ms          1
# Dr          1
# Dona        1
# Name: Title, dtype: int64

X_test['Title'] = X_test['Title'].replace(['Lady', 'the Countess', 'Countess', 'Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona', 'Ms', 'Mme', 'Mlle'], 'Rare')
print(X_test['Title'].value_counts()) 
# Mr        240
# Miss       78
# Mrs        72
# Master     21
# Rare        7
# Name: Title, dtype: int64

  1. Number of family members cleaned
# 从 SibSp(同代家属:兄弟姐妹或配偶)和 Parch(不同代家属:父母或子女)中提取上船的家庭人数=SibSp+Parch+1,并将家庭人数标签化
dataset['FamilyS'] = dataset['SibSp'] + dataset['Parch'] + 1
X_test['FamilyS'] = X_test['SibSp'] + X_test['Parch'] + 1
def family(x):
    if x < 2:
        return 'Single'
    elif x == 2:
        return 'Couple'
    elif x <= 4:
        return 'InterM'
    else:
        return 'Large'
    
dataset['FamilyS'] = dataset['FamilyS'].apply(family)
X_test['FamilyS'] = X_test['FamilyS'].apply(family)
print(dataset['FamilyS'].value_counts()) 
# Single    537
# Couple    161
# InterM    131
# Large      62
# Name: FamilyS, dtype: int64
  1. Missing data filling
# 将 登船港口列 缺失的值替换为该列的众数(如有多个众数取第一个)
dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace=True)
X_test['Embarked'].fillna(X_test['Embarked'].mode()[0], inplace=True)
# 将 年龄列 缺失的值替换为该列的平均数
dataset['Age'].fillna(dataset['Age'].median(), inplace=True)
X_test['Age'].fillna(X_test['Age'].median(), inplace=True)
# 将 船票价格 缺失的值替换为该列的平均数
dataset['Fare'].fillna(dataset['Fare'].median(), inplace=True)
X_test['Fare'].fillna(X_test['Fare'].median(), inplace=True)
  1. Do not use data clearing
# 删掉不使用的列
dataset = dataset.drop(['PassengerId', 'Cabin', 'Name', 'SibSp', 'Parch', 'Ticket'], axis=1)
X_test_passengers = X_test['PassengerId']
X_test = X_test.drop(['PassengerId', 'Cabin', 'Name', 'SibSp', 'Parch', 'Ticket'], axis=1)
print(dataset)
#      Survived  Pclass     Sex   Age     Fare Embarked Title FamilyS
# 0           0       3    male  22.0   7.2500        S    Mr  Couple
# 1           1       1  female  38.0  71.2833        C   Mrs  Couple
# 2           1       3  female  26.0   7.9250        S  Miss  Single
# 3           1       1  female  35.0  53.1000        S   Mrs  Couple
# 4           0       3    male  35.0   8.0500        S    Mr  Single
# ..        ...     ...     ...   ...      ...      ...   ...     ...
# 886         0       2    male  27.0  13.0000        S  Rare  Single
# 887         1       1  female  19.0  30.0000        S  Miss  Single
# 888         0       3  female  28.0  23.4500        S  Miss  InterM
# 889         1       1    male  26.0  30.0000        C    Mr  Single
# 890         0       3    male  32.0   7.7500        Q    Mr  Single

3. Divide the training set, validation set and test set

  1. Sample and Label Separation
# 拆分训练集的样本和标签
X_train = dataset.iloc[:, 1:9].values
Y_train = dataset.iloc[:, 0].values
# 测试集只有样本没有标签
X_test = X_test.values
  1. Convert text labels to one-hot encoding
# 文本标签直接转为 one-hot 编码
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

print(X_train[0], X_train[0].shape)
# [3 'male' 22.0 7.25 'S' 'Mr' 'Couple'] (7,)

one_hot_encoder = ColumnTransformer(
    [(
        'one_hot_encoder',                   # 转换器名称(任意)
        OneHotEncoder(categories='auto'),    # 编码类型
        [0, 1, 4, 5, 6]                      # 对哪些列进行编码
    )],   
    remainder='passthrough'                  # 保留未编码的列
)
X_train = one_hot_encoder.fit_transform(X_train).tolist()
X_test = one_hot_encoder.fit_transform(X_test).tolist()

print(X_train[0], len(X_train[0]))
# [0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 22.0, 7.25] 19
# 7列变为19列的稀疏矩阵
  1. Split 1/10 of the data from the training set as a validation set
# 拆分训练集和验证集
from sklearn.model_selection import train_test_split

x_train, x_val, y_train, y_val = train_test_split(X_train, Y_train, test_size = 0.1)
print(len(x_train))
# 801
print(len(x_val))
# 90
print(len(y_train))
# 801
print(len(y_val))
# 90

4. Create a model

# 创建神经网络
import torch
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(19, 270)     # 上采样到270
        self.fc2 = nn.Linear(270, 2)      # 2分类
        
    def forward(self, x):
        x = self.fc1(x)
        x = F.dropout(x, p=0.1)
        x = F.elu(x)
        x = self.fc2(x)
#         x = torch.sigmoid(x)              # 输出结果映射到(0,1)之间
        
        return x
    
net = Net()

5. Specify training parameters

# 训练参数
batch_size = 50
num_epochs = 50
learning_rate = 0.01
batch_no = len(x_train) // batch_size

6. Define loss function, optimizer

# 损失函数、优化器
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate)

7. Training

# 训练
from sklearn.utils import shuffle
from torch.autograd import Variable

for epoch in range(num_epochs):
    if epoch % 5 == 0:
        print('Epoch {}'.format(epoch+1))
    x_train, y_train = shuffle(x_train, y_train)     # 乱序 
    # Mini batch learning
    for i in range(batch_no):
        start = i * batch_size
        end = start + batch_size
        x_var = Variable(torch.FloatTensor(x_train[start:end]))
        y_var = Variable(torch.LongTensor(y_train[start:end]))
        # Forward + Backward + Optimize
        optimizer.zero_grad()
        ypred_var = net(x_var)
        loss =criterion(ypred_var, y_var)
        loss.backward()
        optimizer.step()

8. Verification accuracy

# 验证准确率
test_var = Variable(torch.FloatTensor(x_val), requires_grad=True)
with torch.no_grad():
    result = net(test_var)
# values, labels = torch.max(result, 1)
# print(values, labels)
labels = torch.argmax(result, dim=1)
print(labels)
# tensor([1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0,
#        0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
#        0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
#        0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1])
num_right = np.sum(labels.data.numpy() == y_val)
print('Accuracy {:.2f}'.format(num_right / len(y_val)))
# Accuracy 0.81

9. Forecast

# 预测
X_test_var = Variable(torch.FloatTensor(X_test), requires_grad=True) 
with torch.no_grad():
    test_result = net(X_test_var)
#     print(test_result)
values, labels = torch.max(test_result, 1)
survived = labels.data.numpy()
print(f"预测结果:{
      
      survived}")
# 预测结果:[0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 0 1
#  1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 1 0
#  1 1 0 1 0 1 1 0 0 0 0 0 1 1 1 1 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
#  1 1 1 1 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0
#  1 0 1 0 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1
#  0 1 0 0 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0
#  1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 1
#  0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0
#  1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
#  1 0 0 0 0 0 1 0 0 0 1 1 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0
#  0 1 0 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0
#  1 1 1 1 1 1 0 1 0 0 1]

Guess you like

Origin blog.csdn.net/weixin_43721000/article/details/130429132