Titanic data is a classic data mining data set. This article introduces the case sharing ranked No. 1 on kaggle.
Original notebook address: https://www.kaggle.com/startupsci/titanic-data-science-solutions
Article directory
ranking
Take a look at the ranking of this case:
The gap between the first place and the second place is not too much, and the comments of the second place are far more than the first place; when you have time, let’s learn the ideas of the second place together.
Through my own overall learning of the source code of the first place, the processing of the fields in the early stage is very detailed and comprehensive; the modeling process is a little shallow.
How to get the data set?
Technology must learn to share and communicate, and it is not recommended to work behind closed doors. A person can go fast, a group of people can go farther.
The source code, information, data, and technical exchange improvements in the article can be obtained by adding the knowledge planet exchange group. The group has more than 2,000 members. When adding, remember to add the following remarks: source + interest direction, which is convenient for finding like-minded friends.
Method ①, add WeChat account: pythoner666, remarks: from CSDN + Titanic
Method ②, WeChat search official account: Python learning and data mining, background reply: add group
data exploration
import library
Import the three types of libraries needed throughout the process:
-
data processing
-
visualization library
-
modeling library
# 数据处理
import pandas as pd
import numpy as np
import random as rnd
# 可视化
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# 模型
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
Import Data
Check the size of the data after importing the data
field information
View all fields:
train.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
The following are the specific meanings of the fields:
-
PassengerId: user id
-
Survival: whether to survive, 0-no, 1-yes
-
pclass: class, 1-first class, 2-second class, 3-third class
-
name: name
-
sex: gender
-
Age: age
-
sibsp: Number of siblings/spouses on board
-
parch: number of parents/children on board
-
ticket: ticket number
-
fare: fare
-
cabin: Cabin number; cabin number
-
embarked: Embarked location
field classification
There are two main types of data in this case:
-
Categorical: Survived, Sex, and Embarked. Ordinal: Pclass
-
Continuous Continuous: Age, Fare. Discrete: SibSp, Parch
missing value
Check the missing values in the training and test sets:
At the same time, you can also check the basic information of the data through the info function:
data assumptions
Based on the basic information and common sense of the data, the author gives some assumptions of his own and the direction of subsequent data processing and analysis:
delete field
-
This project is mainly to examine the relationship between other fields and the Survival field
-
Focus on fields: Age, Embarked
-
Deleted fields: no effect on data analysis, directly deleted fields: Ticket (ticket number), Cabin (cabin number), PassengerId (passenger number), Name (name)
Modify and add fields
-
Increase Family: According to Parch (the number of brothers and sisters on board) and SibSp (the number of parents and children on board)
-
Extract Title from the Name field as a new feature
-
Convert the Age field into an ordered categorical feature
-
Create a feature based on the Fare Range
guess
-
Women (Sex=female) are more likely to survive
-
Children (Age>?) are more likely to survive
-
Passengers with higher cabin class are more likely to survive (Pclass=1)
Statistical Analysis
Mainly analyze the classified variable Sex, ordered variable Pclss, discrete SibSp, Parch to verify our conjecture
1. Cabin class (1-first class, 2-second class, 3-third class)
Conclusion: People in first class are more likely to survive
2. Gender
Conclusion: Women are more likely to survive
3. Number of siblings/spouse
Conclusion: Passengers with relatively few siblings or spouses are more likely to survive
4. Number of parents/children
Conclusion: When parents and children are 3, it is easier to survive
visual analysis
age and survival
g = sns.FacetGrid(train, col="Survived")
g.map(plt.hist, 'Age', bins=20)
plt.show()
-
For those who survived, most of them were between 15 and 25 years old (left picture)
-
The oldest survivor is 80; at the same time, the survival rate of children under the age of 4 is very high (right picture)
-
Most of the passengers are aged between 15 and 35 (two pictures)
Seat and Survival
grid = sns.FacetGrid(
train,
col="Survived",
row="Pclass",
size=2.2,
aspect=1.6
)
grid.map(plt.hist,"Age",alpha=0.5,bins=20)
grid.add_legend()
plt.show()
-
Class 3 had the most passengers; but many did not survive
-
Passengers in class 1 survived the most
Embarkation location, gender, and survival
grid = sns.FacetGrid(train,
row="Embarked",
size=2.2,
aspect=1.6)
grid.map(sns.pointplot,
"Pclass",
"Survived",
"Sex",
palette="deep")
grid.add_legend()
plt.show()
-
Women survive better than men
-
Except in Embarked=C, males have a higher survival rate.
-
When the travel class is Pclass=3, the survival rate of males in Embarked=C is better than Q
Fares, Classes and Survival
grid = sns.FacetGrid(train,
row='Embarked',
col='Survived',
size=2.2, aspect=1.6)
grid.map(sns.barplot,
'Sex',
'Fare',
alpha=.5, ci=None)
grid.add_legend()
plt.show()
-
The higher the ticket price, the better the survival effect; the picture 2 on the right
-
The survival rate is related to the boarding position; it is obviously the best in the case of Embarked=C
The above is based on simple statistics and visualization analysis. The following process is based on various machine learning modeling methods for analysis. A lot of preprocessing and feature engineering work has been done in the early stage.
remove invalid fields
The fare ticket and the cabin number Cabin are almost useless for our analysis, you can consider deleting them directly:
generate new features
It is mainly to find a certain relationship based on the existing feature attributes to generate new features, or to perform certain feature attribute transformations.
Field Name processing
Generate and find titles based on the name, such as Lady, Dr, Miss, etc., to check whether there is a relationship between the title and the survival information
# 通过正则提取
for dataset in combine:
dataset["Title"] = dataset.Name.str.extract('([A-Za-z]+)\.', expand=False)
# 统计Title下的男女数量
train.groupby(["Sex","Title"]).size().reset_index()
Statistics in the form of a crosstab:
# 交叉表形式
pd.crosstab(train['Title'], train['Sex'])
Organize the extracted titles and classify them into common titles and Rare information:
for dataset in combine:
dataset["Title"] = dataset["Title"].replace(['Lady', 'Countess','Capt', 'Col',\
'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
# 根据称谓Title求生还的均值
train[["Title","Survived"]].groupby("Title",as_index=False).mean()
The title itself is text-type and is useless for later modeling, so we directly convert it to a numerical type:
title_mapping = {
"Mr":1,
"Miss":2,
"Mrs":3,
"Master":4,
"Rare":5
}
for dataset in combine:
# 存在数据的进行匹配
dataset['Title'] = dataset['Title'].map(title_mapping)
# 不存在则补0
dataset['Title'] = dataset['Title'].fillna(0)
train.head()
At the same time, some fields need to be deleted:
train = train.drop(['Name', 'PassengerId'], axis=1)
test = test.drop(['Name'], axis=1)
combine = [train, test]
train.shape, test.shape
# ((891, 9), (418, 9))
fieldSex
Convert gender Male and Female to 0-Male, 1-Female
for dataset in combine:
dataset['Sex'] = dataset['Sex'].map( {
'female': 1, 'male': 0} ).astype(int)
Relationship between sex, age, and survival:
grid = sns.FacetGrid(
train,
row='Pclass',
col='Sex',
size=2.2,
aspect=1.6)
grid.map(plt.hist,
'Age',
alpha=.5,
bins=20)
grid.add_legend()
plt.show()
FieldAge
1. The first is the missing value processing of the field.
We observed that there are missing values in the age field, and we fill them in through six combinations of Sex (0, 1) and Pclass (1, 2, 3). Missing value case:
The specific process of filling:
guess_ages = np.zeros((2,3))
for dataset in combine:
for i in range(0,2):
for j in range(0,3):
# 找到某种条件下Age字段的缺失值并删除
guess_df = dataset[(dataset["Sex"] == i) & (dataset["Pclass"] == j+1)]["Age"].dropna()
age_guess = guess_df.median() # 中位数
guess_ages[i,j] = int(age_guess / 0.5 + 0.5) * 0.5
for i in range(0,2):
for j in range(0,3):
dataset.loc[(dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),"Age"] = guess_ages[i,j]
dataset["Age"] = dataset["Age"].astype(int)
# 填充后不存在缺失值
train.isnull().sum()
2. Age division and binning
3. Convert to numerical classification
-
If the age is less than 16, replace it with 0
-
16 to 32 are replaced by 1 etc...
for dataset in combine:
dataset.loc[dataset["Age"] <= 16, "Age"] = 0
dataset.loc[(dataset["Age"] > 16) & (dataset["Age"] <= 32), "Age"] = 1
dataset.loc[(dataset["Age"] > 32) & (dataset["Age"] <= 48), "Age"] = 2
dataset.loc[(dataset["Age"] > 48) & (dataset["Age"] <= 64), "Age"] = 3
dataset.loc[(dataset["Age"] > 64), "Age"] = 4
# 删除年龄段AgeBand字段
train = train.drop(["AgeBand"], axis=1)
combine = [train, test]
field handling
Generate new fields based on existing fields:
generate new field 1
First generate a FamilySize field based on the two fields Parch and SibSp
for dataset in combine:
dataset["FamilySize"] = dataset["SibSp"] + dataset["Parch"] + 1
# 每个FamilySize的生还均值
train[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)
Determine whether it is Islone according to the field FamilySize: If the family member FamilySize is a person, it must be Islone, and it is represented by 1, otherwise it is represented by 0
Finally, delete Parch, SibSp, and FamilySize, and only keep whether a person is Islone:
# 将 Parch, SibSp, and FamilySize删除,仅保留是否一个人Islone
train = train.drop(['Parch', 'SibSp', 'FamilySize'],axis=1)
test = test.drop(['Parch', 'SibSp', 'FamilySize'],axis=1)
combine = [train, test]
train.head()
generate new field 2
New field 2 is the product of Age and Pclass:
Classification of Embarked Fields
The value of the Embarked field has SQC. First we fill in the missing values inside
Check that this field has missing values:
Processing: Find the mode, fill in missing values, and view the mean of each value
Convert text type to numeric type:
Fare field processing
There is no missing value in this field in the training set, and there is one in the test set:
Use the median value for padding:
Carry out binning operations:
# 只对FareBand字段分箱
train['FareBand'] = pd.qcut(train['Fare'], 4) # 分成4组
# 生还的均值
train[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)
Convert each segment to numeric data:
# 4个分段
for dataset in combine:
dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
dataset['Fare'] = dataset['Fare'].astype(int)
#
train = train.drop(['FareBand'], axis=1)
combine = [train, test]
test.head()
This way we get the final fields and data for modeling:
modeling
The following is the specific modeling process, we first divide the data set:
# 训练集
X_train = train.drop("Survived", axis=1)
Y_train = train["Survived"]
# 测试集
X_test = test.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape
The specific process of each model:
-
Create model instantiated objects
-
Fit training set
-
Make predictions on the test set
-
Calculation accuracy
Model 1: Logistic Regression
# 模型实例化
logreg = LogisticRegression()
# 拟合过程
logreg.fit(X_train, Y_train)
# 测试集预测
Y_pred = logreg.predict(X_test)
# 准确率求解
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log
# 结果
81.37
The coefficients obtained for the logistic regression model:
# 逻辑回归特征和系数
coeff_df = pd.DataFrame(train.columns[1:]) # 除去Survived特征
coeff_df.columns = ["Features"]
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])
# 从高到低
coeff_df.sort_values(by='Correlation', ascending=False)
Conclusion: Gender is really an important factor in our lives
Model 2: Support Vector Machine SVM
Model 3: KNN
Model 4: Naive Bayes
Model 5: Perceptron
Model 6: Linear Support Vector Classification
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc
# 结果
79.46
Model 7: Stochastic Gradient Descent
Model 8: Decision Tree
Model 9: Random Forest
Model comparison
Compare the results (accuracy) of the above 9 models:
models = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',
'Random Forest', 'Naive Bayes', 'Perceptron',
'Stochastic Gradient Decent', 'Linear SVC',
'Decision Tree'],
'Score': [acc_svc, acc_knn, acc_log,
acc_random_forest, acc_gaussian, acc_perceptron,
acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)
By comparing the results: the decision tree and random forest perform best in this data set; the second is the KNN (K nearest neighbor) algorithm.