Logistic Regression Predicts Titanic Passenger Survival

Logistic Regression Predicts Titanic Passenger Survival

describe

The sinking of the RMS Titanic is one of the most notorious shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1,502 of her 2,224 passengers and crew. This sensational tragedy shocked the international community and led to better ship safety regulations.

One of the causes of shipwrecks is that there are not enough lifeboats for passengers and crew. Although there is some element of luck involved in surviving the sinking, some people are more likely to survive than others, such as women, children and the upper class.

In this challenge, we complete an analysis of who is likely to survive. In particular, we will apply machine learning tools to predict which passengers will survive the tragedy.

This task will use machine learning algorithm logistic regression to predict the survival rate of Titanic passengers. The tasks include data acquisition, data analysis, missing data processing, feature engineering, splitting data sets, model building, prediction, model evaluation, etc.

Source code download

environment

  • Operating system: Windows 10

  • Tool software: Anaconda3 2019, Python3.6, Jupyter Notebook

  • Hardware environment: no special requirements

  • Dependency library list

    scikit-learn	0.24.2
    

analyze

"Logistic Regression Predicts Titanic Passenger Survival Rate" involves the following links:

Please add a picture description

implement

1. Clear goals

The purpose of data analysis depends on the needs of the project. Data analysis that is divorced from the actual project needs is like water without a source, so the first step is to clarify what the goal of the analysis is.

This project is the famous Titanic shipwreck. Here we want to know what factors affect the survival rate of the shipwreck, and predict the survival rate of passengers.

2. Data Analysis

2.1 Data Acquisition

The Titanic passenger dataset can be downloaded from the project website: https://www.kaggle.com/c/titanic/data

The training set data is: train.csv The test set data is: test.csv

The dataset is already provided for this task, so there is no need to download it repeatedly.

2.2 Data import

Open Jupyter Notebook, create the file "Logistic Regression Prediction of Titanic Passenger Survival Rate.ipynb", put the downloaded dataset file in the same directory as the code, enter the following code, and import the data:

import pandas as pd # 导入pandas库
import numpy as np  # 导入numpy库

df_titanic = pd.read_csv('../dataset/train.csv')  # 读取文件
print("训练集数据:", df_titanic.shape)   # 查看训练数据集行列数
df_titanic.head()                      # 显示前5行数据

Experimental results:

Please add a picture description

The meaning of each data field:

  • PassengerId: Passenger's unique identification id
  • Survived: Whether to survive, 0 is no, 1 is yes
  • Pclass: cabin class, 1, 2, 3, etc.
  • Name: name
  • Sex: sex
  • Age: age
  • SibSp: Number of siblings and spouse traveling with the passenger
  • Parch: The number of parents and children traveling with this passenger
  • Ticket: ticket number
  • Fare: ticket price
  • Cabin: cabin number
  • Embarked: Boarding port S=Southampton, UK (starting point) C=Cherbourg, France (passing point) Q=Queenstown, Ireland (passing point)

2.3 View data information

# 缺失值查询
df_titanic.isnull().sum()

Output result:

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

You can see that Age has 177 missing values, Cabin has 687 missing values, and Embarked has 2 missing values.

3. Feature Engineering

Feature engineering is an indispensable part of machine learning and occupies a very important position in the field of machine learning. Feature engineering refers to the use of a series of engineering methods to filter out better data features from raw data to improve the training effect of the model. There is a widely circulated saying in the industry: data and features determine the upper limit of machine learning, and models and algorithms are only approaching this upper limit. It can be seen that good data and features are the prerequisites for models and algorithms to play a greater role. Feature engineering usually includes data preprocessing, feature selection, and dimensionality reduction. As shown below:

Please add a picture description

3.1 Data cleaning

There are many missing values ​​in Age. We can use the mean value to fill in the missing values ​​of Age. Embarked is categorical data, which is generally filled with mode. Here we do not deal with Cabin, because Cabin is a data feature that has little impact on model training, we can ignore the existence of this feature, and these irrelevant features need to be removed in subsequent operations. The Age mean data filling code is as follows:

df_titanic['Age'] = df_titanic['Age'].fillna(df_titanic['Age'].mean())  # 使用均值数据填充年龄特征
df_titanic.isnull().sum()

Experimental results:

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

You can see that the age feature has no empty fields.

Find the Embarked characteristic mode:

df_titanic['Embarked'].value_counts()  # 寻找众数

Experimental results:

S    644
C    168
Q     77
Name: Embarked, dtype: int64

Experiments have proved that S is the mode, and we use S to fill in the missing values:

df_titanic['Embarked'] = df_titanic['Embarked'].fillna('S')  # 用众数填充缺失值
df_titanic.isnull().sum()    # 缺失值查询

Experimental results:

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

Missing values ​​for the Age and Embarked features have been filled. Students can try to fill the Cabin feature in the same way as filling Embarked.

3.2 Feature Extraction

Convert Sex features and Embarked features to dumb features, and then display the data:

# 把类别型变量转换为哑变量
a = pd.get_dummies(df_titanic['Sex'], prefix = "Sex")
b = pd.get_dummies(df_titanic['Embarked'], prefix = "Em")
# 把哑变量添加进dataframe
frames = [df_titanic, a, b]
df_titanic = pd.concat(frames, axis = 1)
df_titanic = df_titanic.drop(columns = ['Sex', 'Embarked'])
df_titanic.head() # 显示新的dataframe

Experimental results:

Please add a picture description

3.3 Feature Selection

Remove irrelevant features from the dataset:

X = df_titanic.drop(['Survived','Name','Ticket','Cabin'], axis=1) # 拿掉比较不相关的字段,构建特征集
y = df_titanic.Survived.values # 构建标签集
y = y.reshape(-1,1) # -1是相对索引,等价于len(y)
X.head()  # 显示特征数据的前五行

Experimental results:
Please add a picture description

3.4 Split Dataset

Split the training set data into training set and test set according to the ratio of 8:2, 80% of the data is used to train the model, and 20% of the data is used to evaluate the model:

from sklearn.model_selection import train_test_split    # 导入拆分数据集的模块
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2)    # 将数据集拆分为8:2
print("X_train", X_train.shape)     # 显示训练集行列数
print("X_test", X_test.shape)       # 显示测试集行列数

Experimental results:

X_train (712, 11)
X_test (179, 11)

Interface Description:

X_train, X_test, y_train, y_test=sklearn.model_selection.train_test_split(train_data,
train_target,
test_size=0.4,
random_state=0,
stratify=y_train)

  • train_data: the sample feature set to be divided

  • train_target: the sample result to be divided

  • test_size: sample proportion, if it is an integer, it is the number of samples, the default is 0.25

  • random_state: is the seed of the random number. When repeated experiments are required, it is guaranteed to obtain the same set of random numbers. For example, fill in 1 every time, and the random array you get is the same when other parameters are the same. But if you fill in 0 or nothing, it will be different every time. (Reference Code)

  • Return value: X_train - training set feature set, X_test - training set label, y_train - test set feature, y_test - test set label

3.5 Data Standardization

Data standardization is to transform the data to a mean of 0 and a variance of 1 by transforming the original data. Its mathematical expression is as follows:

Please add a picture description

mean is the mean and σ is the standard deviation. Since normalization is based on the mean, it is less affected by outliers. In this experiment, we use the StandardScaler in sklearn.preprocessing to standardize the data:

from sklearn.preprocessing import StandardScaler # 导入标准化模块

scaler = StandardScaler() # 选择标准化数据缩放器
X_train = scaler.fit_transform(X_train) # 特征标准化 训练集fit_transform
X_test = scaler.transform(X_test) # 特征标准化 测试集transform

4. Model construction

The interface for building a logistic regression model is LogisticRegression(), which is included in the sklearn.linear_model module. After creating the logistic regression model, you can directly use the fit() interface to train the model.

4.1 Build the model

Create a logistic regression model using the LogisticRegression() interface:

from sklearn.linear_model import LogisticRegression #导入逻辑回归模型

lr = LogisticRegression() # lr,就代表是逻辑回归模型

4.2 Model training

Use the fit() interface to train the model, and the interface parameters are the training set feature X_train and the test set label y_train:

lr.fit(X_train,y_train) # fit,就相当于是梯度下降

5. Model evaluation

The evaluation of the model uses the test set split from the training set for testing, and the trained model is used to predict the feature X_test of the test set, and the prediction result is compared with the real label y_test of the test set to obtain the accuracy of the model.

5.1 Model Evaluation

The accuracy of the model is calculated through the score() interface of the lr model. The interface parameters are the test set feature X_test and the test set true label y_test.

score = lr.score(X_test,y_test)       # 计算准确率
print("SK-learn逻辑回归测试准确率 {:.2f}%".format(score*100))    # 打印结果

Experimental results:

SK-learn逻辑回归测试准确率 77.09%

The accuracy of model training is 77.09%.

5.2 Data Prediction

Import the data in test.csv, fill in the missing values ​​of the predicted data according to the preprocessing method of the training set data, standardize, and then make predictions through the trained model.

read forecast data

# 读入预测数据
df_titanic_test = pd.read_csv('../dataset/test.csv') # 读取文件
print("预测集数据:", df_titanic_test.shape)
df_titanic_test.head() # 显示前5行数据

Experimental results:

Please add a picture description

The prediction data is 11 features, and there is no label (survival or not).

Query for missing values

# 缺失值查询

df_titanic_test.isnull().sum()

Experimental results:

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

Among them, Age, Cabin, and Fare have missing values. We still only fill Age and Fare. In general, Age and Fare are filled with the mean value:

# 用均值填充Age和Fare
df_titanic_test['Age']=df_titanic_test['Age'].fillna(df_titanic_test['Age'].mean())
df_titanic_test['Fare']=df_titanic_test['Fare'].fillna(df_titanic_test['Fare'].mean())
df_titanic_test.isnull().sum()

Experimental results:

PassengerId      0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          327
Embarked         0
dtype: int64

Filling is complete.

Turn Sex and Embarked features into dummy variables

# 把类别型变量转换为哑变量
a = pd.get_dummies(df_titanic_test['Sex'], prefix = "Sex")
b = pd.get_dummies(df_titanic_test['Embarked'], prefix = "Em")
# 把哑变量添加进dataframe
frames = [df_titanic_test, a, b]
df_titanic_test = pd.concat(frames, axis = 1)
df_titanic_test = df_titanic_test.drop(columns = ['Sex', 'Embarked'])
df_titanic_test.head() # 显示新的dataframe

Experimental results:

Please add a picture description

Eliminate irrelevant features

df_titanic_test = df_titanic_test.drop(['Name','Ticket','Cabin'], axis=1) # 拿掉比较不相关的字段,构建特征集
df_titanic_test.head()  # 显示特征数据的前五行

Experimental results:

Please add a picture description

data standardization

df_titanic_test = scaler.transform(df_titanic_test) # 特征标准化 测试集transform

data prediction

pred_titanic_test = lr.predict(df_titanic_test)  # 用lr模型对预测数据进行预测
print(pred_titanic_test)

Experimental results:

[0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 0 1 0 0 0 0 0 1 1 1 0 1
 1 0 0 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 1
 1 1 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 1 1 0 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0
 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1
 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0
 1 0 1 1 0 1 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 0 0 0 1
 0 0 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 1 1 0 0 0 0
 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 1 0
 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0
 0 1 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 0 0
 0 1 1 1 1 1 0 1 0 0 0]

Calculate Survival Rate

survival_rate = pred_titanic_test.tolist().count(1)/len(pred_titanic_test.tolist())  # tolist()是将ndarray类型转换为list类型
print("泰坦尼克号乘客生存率为: {:.2f}%".format(survival_rate*100))    # 打印结果

Experimental results:

泰坦尼克号乘客生存率为: 36.12%

Guess you like

Origin blog.csdn.net/qq_40186237/article/details/130148367