GhatGPT achieves Titanic survival prediction

ChatGPT achieves Titanic survival prediction

1 Introduction

1.1 Research Background

The Titanic is a famous shipwreck in history, which occurred in 1912 and caused a large number of casualties among passengers. The dataset is derived from "Titanic: Machine Learning from Disaster" in the Kaggle competition and provides various information about passengers. The goal of this study is to build a model to predict passenger survival by analyzing the relationship between passenger characteristics and survival.

1.2 Dataset Introduction

The data set includes two parts: training set (train.csv) and test set (test.csv). The training set contains passenger feature information, such as cabin class, gender, age, etc., and the survival status of passengers (survival or not). The test set is consistent with the training set except that there is no passenger survival attribute.

1.3 Research purpose

The main purpose of this study is to build a predictive model that can accurately predict passenger survival based on the provided dataset. In addition, we will explore the relationship between different characteristics and passenger survival and make relevant recommendations based on the analysis results.

2. Data description

2.1 Data sources

The dataset comes from "Titanic: Machine Learning from Disaster" in the Kaggle competition, which aims to predict the survival of passengers through machine learning methods.

2.2 Dataset structure and variable definition

The provided training dataset contains the following variables:
insert image description here

2.3 Data summary and statistical analysis

This part loads the training set and test set data, and then performs data summarization and statistical analysis. Basic statistical characteristics of a dataset provide information about the distribution, central tendency, and degree of dispersion of each feature in the dataset. The correlation coefficient matrix is ​​used to measure the linear correlation between features, which helps to understand the relationship between features.

The corresponding design code is as follows:

import pandas as pd

# 导入训练集数据
train_data = pd.read_csv('train.csv')

# 导入测试集数据
test_data = pd.read_csv('test.csv')

# 获取数据集的基本统计特征
summary = train_data.describe()
print("数据集的基本统计特征:")
print(summary)

# 计算特征之间的相关系数矩阵
corr_matrix = train_data.corr()
print("\n特征之间的相关系数矩阵:")
print(corr_matrix)

At the beginning of the code, the pandas library is imported, which is used for data processing and analysis. Use the pd.read_csv() function to import the training and test set data from the CSV file and store them in the train_data and test_data variables. The basic statistical characteristics of the training set data are calculated using the describe() function, and the results are stored in the summary variable. Print out the basic statistical features to see a summary of the training set data. The correlation coefficient matrix between the features in the training set data is calculated using the corr() function, and the result is stored in the corr_matrix variable. Finally, print out the correlation coefficient matrix in order to understand the correlation between the features in the training set data.

3. Data processing

3.1 Missing value processing

This section checks for missing values ​​in the dataset and handles missing values. For the training set and test set, first calculate the number of missing values ​​for each variable, and then fill in the missing values.

Handling missing values ​​is an important step in data preprocessing to ensure data integrity and accuracy. In this code, missing values ​​in the age column are filled with the mean, while missing values ​​in the embarkation port and fare columns are filled with the mode.

Below is an example code for missing value handling using the Pandas library:

# 检查缺失值
missing_values1 = train_data.isnull().sum()
print("各变量的缺失值数量:")
print(missing_values1)
missing_values2 = test_data.isnull().sum()
print("各变量的缺失值数量:")
print(missing_values2)

# 处理训练集缺失值:用均值填充年龄列的缺失值,用众数填充登船港口的缺失值
train_data['Age'].fillna(train_data['Age'].mean(), inplace=True)
train_data['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace=True)

# 处理测试集缺失值:用均值填充年龄列的缺失值,用众数填充票价的缺失值
test_data['Age'].fillna(test_data['Age'].mean(), inplace=True)
test_data['Fare'].fillna(test_data['Fare'].mode()[0], inplace=True)

Use the isnull().sum() function to calculate the number of missing values ​​for each variable in the training set data, and store the result in the missing_values1 variable. Use the print() function to print out the number of missing values ​​for each variable in order to view the missing data in the training set. Similarly, use the isnull().sum() function to calculate the number of missing values ​​for each variable in the test set data and store the result in the missing_values2 variable. Use the print() function to print out the number of missing values ​​for each variable in order to view the missing cases in the test set data.

Use the fillna() function to fill the missing values ​​in the 'Age' column in the training set data with the mean. The mean is calculated by calling the mean() function. Fill the missing values ​​in the 'Embarked' column in the training set data with the mode. The mode is calculated by calling the mode()[0] function.

Fill the missing values ​​in the 'Age' column in the test set data with the mean. The mean is calculated by calling the mean() function. Fill the missing values ​​of the 'Fare' column in the test set data with the mode. The mode is calculated by calling the mode()[0] function.

3.2 Feature Engineering

This section creates a new feature 'FamilySize' which represents the passenger's family size. Calculate the number of family members for each passenger by adding 'SibSp' and 'Parch', and adding 1. At the same time, for the categorical feature 'Sex', use LabelEncoder to encode it and convert it into a numerical representation for subsequent modeling and analysis.

These steps are part of feature engineering, which aims to use existing features to create new meaningful features and encode categorical features so that machine learning algorithms can process them.

Here is an example code for feature engineering using the Pandas library:

from sklearn.preprocessing import LabelEncoder

# 创建新特征:家庭大小
train_data['FamilySize'] = train_data['SibSp'] + train_data['Parch'] + 1
test_data['FamilySize'] = test_data['SibSp'] + test_data['Parch'] + 1

# 特征编码:性别
label_encoder = LabelEncoder()
train_data['Sex'] = label_encoder.fit_transform(train_data['Sex'])
test_data['Sex'] = label_encoder.transform(test_data['Sex'])

In the training set, calculate the family size for each passenger by adding 'SibSp' (number of siblings/spouse) and 'Parch' (number of parents/children), plus 1, and store the result in In a new column called 'FamilySize'. In the test set, likewise, calculate the family size for each passenger by adding 'SibSp' and 'Parch', plus 1, and store the result in a new column called 'FamilySize'.

Feature encoding gender, first, instantiate a LabelEncoder object, which is used to encode classification features. Encode the 'Sex' (gender) column of the training set, convert the gender category into a numerical code by calling the fit_transform() method, and overwrite the original 'Sex' column. To encode the 'Sex' column of the test set, use the encoder on the previous training set to convert the gender category into a numerical code by calling the transform() method, and overwrite the original 'Sex' column.

3.3 Data standardization or normalization

The side part standardizes and normalizes the data, and adjusts the numerical ranges of different features to a unified range to eliminate the dimensional differences between features.

Here is an example code for data normalization using the Scikit-learn library:

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# 移除非数值类型的列,例如乘客姓名、船舱编号等
train_data.drop(['Name', 'Cabin', 'Ticket'], axis=1, inplace=True)
test_data.drop(['Name', 'Cabin', 'Ticket'], axis=1, inplace=True)

# 数据标准化
# 分离特征和目标变量

X_train = train_data.drop('Survived', axis=1)
y_train = train_data['Survived']

# 创建Pipeline来处理特征编码和数据标准化
# 列转换器:对非数值类型的列进行编码

categorical_features = ['Sex', 'Embarked']
categorical_transformer = Pipeline(steps=[
  ('encoder', OneHotEncoder())])  

# 列转换器:对数值类型的列进行标准化
numeric_features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'FamilySize']
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])

# 使用ColumnTransformer将两个转换器结合起来
preprocessor = ColumnTransformer(
  transformers=[
    ('cat', categorical_transformer, categorical_features),
    ('num', numeric_transformer, numeric_features)
  ])

# 对训练集数据进行转换
X_train_scaled = preprocessor.fit_transform(X_train)

# 对测试集数据进行转换
X_test_scaled = preprocessor.transform(test_data)

# 划分训练集和验证集
from sklearn.model_selection import train_test_split
X_train_scaled, X_val, y_train, y_val = train_test_split(X_train_scaled, y_train, test_size=0.2, random_state=42)

# 输出标准化后的数据
print(X_train_scaled)
print(X_val)

In the above code, we use ColumnTransformer to combine feature encoding and data standardization converters, and perform pipeline processing through Pipeline. First, we define a one-hot encoding transformer categorical_transformer for non-numeric columns and a normalization transformer numeric_transformer for numeric columns. Then, combine the two converters through ColumnTransformer and define the columns that need to be converted. Finally, we use the fit_transform method to transform the training set data, and use the transform method to transform the test set data.

3.4 Dataset division

Before using the training set for model training, we need to divide the training data set into training set and validation set again for evaluating the performance and tuning of the model. In this way, the validation set can be used for model selection and parameter tuning during model training.

The following is an example code for dataset partitioning using the Scikit-learn library:

from sklearn.model_selection import train_test_split

# 将特征变量和目标变量分离
X = train_data.drop('Survived', axis=1)
y = train_data['Survived']

# 划分数据集为训练集和验证集
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

This code divides the training dataset (train.csv) into training set (X_train and y_train) and validation set (X_val and y_val) for model training and performance evaluation.

4. Data analysis

4.1 Exploratory data analysis

4.1.1 Passenger Survival Rate Analysis

This section compares the number of survivors and non-survivors using a histogram. First, use the value_counts() function to calculate the number of survivors and non-survivors in the 'Survived' column. Then, use the plot() function to draw a histogram, with survivors and non-survivors as x-axis labels, and the number of people as y-axis labels to visually show the comparative relationship between the two.

The code corresponding to this part is as follows:

import matplotlib.pyplot as plt
import seaborn as sns

# 幸存者与非幸存者的人数比较
survived_count = train_data['Survived'].value_counts()
survived_count.plot(kind='bar', rot=0)
plt.xlabel('Survived')
plt.ylabel('Count')
plt.title('Survived vs. Count')
plt.show()

The corresponding result is shown in the figure below:
insert image description here

​ Figure 4-1 Histogram of the number of survivors and non-survivors

4.1.2 Analysis of the relationship between different characteristics and survival rate

This section analyzes and visualizes the relationship between sex and survival and the relationship between cabin class and survival. In order to observe and analyze the influence of characteristics on survival rate. Use the sns.barplot() function to draw a bar graph, where the x-axis represents the feature and the y-axis represents the survival rate, and uses the training set data. Then, use plt.xlabel(), plt.ylabel(), and plt.title() to set the x-axis label, y-axis label, and chart title, respectively. Finally, use plt.show() to display the plotted graph.

The corresponding design code is as follows:

# 性别与生存率的关系
sns.barplot(x='Sex', y='Survived', data=train_data)
plt.xlabel('Sex')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Sex')
plt.show()

# 票价与生存率的关系
sns.barplot(x='Pclass', y='Survived', data=train_data)
plt.xlabel('Pclass')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Pclass')
plt.show()

The corresponding result is shown in the figure below:

insert image description here

​ Figure 4-2 Histogram of the relationship between gender and survival rate

insert image description here

Figure 4-3 Histogram of the relationship between ticket price and survival rate

Use the .corr() method to calculate the correlation coefficient matrix between the individual features. Use the heat map to visualize the correlation coefficient matrix and observe the strength of the correlation between the various features.

# 计算特征之间的相关系数
correlation = train_data.corr()

# 绘制特征相关性热力图
plt.figure(figsize=(10, 8))
sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

The corresponding result is shown in the figure below:

insert image description here

​ Figure 4-4 Feature correlation heat map

4.2 Machine Learning Algorithm Prediction

Algorithms such as logistic regression, decision tree, and random forest can be selected for model training and prediction. Use the training data set to divide the training set and validation set for model training and cross-validation evaluation. Use evaluation metrics (such as accuracy, precision, recall, F1-score) to evaluate the performance of the model.

4.2.1 Model selection and training

Before making machine learning algorithm predictions, we need to choose an appropriate model. According to the nature of the problem and the characteristics of the data set, commonly used models include logistic regression, support vector machine, decision tree, random forest, etc. In this case, we chose the Random Forest Classifier (RandomForestClassifier) ​​as the predictive model.

The following is a code example for model selection and training:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# 定义特征列
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'FamilySize']

# 实例化随机森林分类器
model = RandomForestClassifier()

# 训练模型
model.fit(X_train_scaled, y_train)

4.2.2 Model evaluation and optimization

After training the model, we need to evaluate and optimize the model. Commonly used evaluation metrics include accuracy, precision, recall and F1 score.

The corresponding code is as follows:

# 对验证集数据进行转换
X_val_scaled = preprocessor.transform(X_val)

# 预测验证集
val_predictions = model.predict(X_val_scaled)

# 模型评估
accuracy = accuracy_score(y_val, val_predictions)
precision = precision_score(y_val, val_predictions)
recall = recall_score(y_val, val_predictions)
f1 = f1_score(y_val, val_predictions)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

# 预测测试集
predictions = model.predict(X_test_scaled)

# 将预测结果添加到测试集
test_data['Survived'] = predictions

# 保存预测结果为CSV文件
test_data.to_csv('predictions.csv', index=False)

The above code makes predictions on the transformed validation set and calculates the performance metrics of the model on the validation set. Then, use the trained model to make predictions on the test set, and add the prediction results to the test set. Finally, save the test set containing the predicted results as a CSV file.

5. Data visualization

5.1 Survivors vs. non-survivors

This part first counts the number of surviving and non-surviving passengers in the test set, and displays them visually through bar charts and pie charts. Then, create a bar chart based on the PassengerId of the surviving passengers in the test set to show the distribution of surviving passengers.

Here's a code example that compares the number of survivors to the number of non-survivors:

# 统计存活和非存活乘客的数量
survived_count = test_data['Survived'].value_counts()

# 创建条形图统计存活和非存活的数量
plt.figure(figsize=(6, 4))
survived_count.plot(kind='bar', color=['lightblue', 'lightgreen'])
plt.xlabel('Survived')
plt.ylabel('Count')
plt.title('Number of Survived Passengers')
plt.xticks(rotation=0)
plt.show()

The corresponding result is shown in the figure below:

insert image description here

​ Figure 5-1 Histogram of surviving and non-surviving numbers

# 创建饼图统计存活与非存活率
plt.figure(figsize=(4, 4))
plt.pie(survived_count, labels=survived_count.index, autopct='%1.1f%%', startangle=90, colors=['lightblue', 'lightgreen'])
plt.title('Survival Rate of Passengers')
plt.axis('equal')
plt.show()

The corresponding result is shown in the figure below:

insert image description here

​ Figure 5-2 Survival and non-survival ratio pie chart

# 使用训练集数据来判断乘客的幸存情况
train_data['Survived'] = train_data['Survived'].map({
    
    0: 'No', 1: 'Yes'})

# 根据测试集中的PassengerId进行可视化
survived_passengers = test_data[test_data['Survived'] == 1]
survived_passenger_ids = survived_passengers['PassengerId']

# 创建幸存乘客的条形图
plt.figure(figsize=(12, 6))
plt.bar(survived_passenger_ids, [1] * len(survived_passenger_ids), color='green')
plt.xlabel('PassengerId')
plt.ylabel('Survived')
plt.title('Survival of Passengers')
plt.xticks(rotation=90)
plt.show()

The corresponding result is shown in the figure below:

insert image description here

​ Figure 5-3 Bar graph of surviving passengers

For the last surviving passenger bar graph, use plt.figure(figsize=(12, 6)) to create a figure window and set the size of the figure. Use plt.bar(survived_passenger_ids, [1] * len(survived_passenger_ids), color='green') to create a bar chart, use the PassengerId of the surviving passenger as the x-axis, the list with a value of 1 as the y-axis, and the color of the bar is green. That is, if the corresponding passenger survives, there will be a corresponding green column.

5.2 Visualization of the relationship between different features and survival rate

Visualization of the relationship of different features to survival in the side section analysis. The relationship between different characteristics and the survival rate is shown through bar graphs and boxplots, helping to analyze the impact of each feature on the survival rate of passengers.

Here is a sample code for visualizing the relationship between different features and survival:

# 数据可视化
# 不同特征与生存率的关系可视化
plt.figure(figsize=(12, 6))

# 性别与生存率的关系
plt.subplot(2, 3, 1)
sns.barplot(x='Sex', y='Survived', data=test_data)
plt.xlabel('Sex')
plt.ylabel('Survival Rate')

# 船舱等级与生存率的关系
plt.subplot(2, 3, 2)
sns.barplot(x='Pclass', y='Survived', data=test_data)
plt.xlabel('Pclass')
plt.ylabel('Survival Rate')

# 年龄与生存率的关系
plt.subplot(2, 3, 3)
sns.boxplot(x='Survived', y='Age', data=test_data)
plt.xlabel('Survived')
plt.ylabel('Age')

# 家庭大小与生存率的关系
plt.subplot(2, 3, 4)
sns.barplot(x='FamilySize', y='Survived', data=test_data)
plt.xlabel('Family Size')
plt.ylabel('Survival Rate')

# 票价与生存率的关系
plt.subplot(2, 3, 5)
sns.boxplot(x='Survived', y='Fare', data=test_data)
plt.xlabel('Survived')
plt.ylabel('Fare')
plt.tight_layout()
plt.show()

The corresponding result is shown in the figure below:

insert image description here

​ Figure 5-4 Visualization of the relationship between different features and survival rates

6. Data Analysis Conclusions and Suggestions

6.1 Summary of conclusions

This chapter will conduct data analysis on the Titanic dataset and make relevant recommendations based on the analysis results. First, we will explore the relationship between different characteristics and passenger survival, and then propose a predictive model based on the analysis results.

6.2 Analysis of the relationship between characteristics and survival rate

Based on the data summary and statistical analysis, we can draw the following conclusions:

There is a correlation between the class of cabin (Pclass) and the survival rate, and the survival rate of passengers in the first class is higher, while the survival rate of passengers in the third class is lower.

Gender (Sex) is strongly associated with survival, with higher survival rates for females and lower survival rates for males.

The age (Age) has a weak correlation with the survival rate, and the survival rate of passengers in different age groups has little difference.

The relationship between the number of relatives of the same generation (SibSp) and the number of relatives of different generations (Parch) and the survival rate was complex, and there was no obvious linear correlation.

There is a certain correlation between the fare (Fare) and the survival rate, and the survival rate of passengers with higher fares is higher.

The relationship between family size (FamilySize) and survival rate is also complex, and passengers with a moderate number of family members have a higher survival rate.

6.3 Recommendations based on analysis results

Consider cabin class (Pclass) as an important feature that can be incorporated into predictive models. The higher survival rate of first-class passengers may be related to their access to more rescue resources in emergencies.

Gender (Sex) is another important characteristic, and females have a higher survival rate. Therefore, in predictive models, gender should be considered as a key characteristic.

Although age (Age) has a weak correlation with survival rate, it can be considered to be included in the prediction model to explore possible age-related patterns.

Fare, as a feature related to the survival rate, can be used as an input to the predictive model. Passengers with higher fares may have better access to rescue resources in an emergency.

Family size (FamilySize) can also be used as a feature to consider the impact of the number of family members on passenger survival. However, attention needs to be paid to the complex relationship between family size and survival to avoid over-interpreting the data.

7. Summary

7.1 Research Summary

This paper presents a detailed data analysis and prediction on the Titanic Survival Dataset. First, an exploratory analysis was performed on the data, including the basic statistical characteristics of the data and the treatment of missing values. Then, feature engineering was performed, the data was populated, transformed and new features were created. Next, the random forest algorithm was selected for model training and prediction, and the performance of the model was evaluated. Through visual analysis, we gained insight into the relationship between different characteristics and survival. Finally, the analysis results are summarized and corresponding suggestions are put forward.

7.2 Limitations and directions for further research

Through this analysis, we gained a better understanding of the Titanic Survival Dataset and obtained a predictive model for the survival of the passengers. These analyzes and results can provide valuable references for formulating rescue plans, resource allocation, and decision support. At the same time, it also demonstrates the importance and application value of data analysis in solving practical problems. In future research and practice, the model can be further improved, more features and algorithms can be explored, and data analysis can be applied to a wider range of fields and practical scenarios to provide more insights and support for decision-making and problem solving.

Guess you like

Origin blog.csdn.net/weixin_51735748/article/details/131668345