10,000 words, I use Python to analyze Titanic data

Titanic data is a classic data mining data set. This article introduces the case sharing ranked No. 1 on kaggle.

Original notebook address: https://www.kaggle.com/startupsci/titanic-data-science-solutions

picture

ranking

Take a look at the ranking of this case:

picture

The gap between the first place and the second place is not too much, and the comments of the second place are far more than the first place; when you have time, let’s learn the ideas of the second place together.

Through my own overall learning of the source code of the first place, the processing of the fields in the early stage is very detailed and comprehensive; the modeling process is a little shallow.

How to get the data set?

Technology must learn to share and communicate, and it is not recommended to work behind closed doors. A person can go fast, a group of people can go farther.

The source code, information, data, and technical exchange improvements in the article can be obtained by adding the knowledge planet exchange group. The group has more than 2,000 members. When adding, remember to add the following remarks: source + interest direction, which is convenient for finding like-minded friends.

Method ①, add WeChat account: pythoner666, remarks: from CSDN + Titanic
Method ②, WeChat search official account: Python learning and data mining, background reply: add group

data exploration

import library

Import the three types of libraries needed throughout the process:

  • data processing

  • visualization library

  • modeling library

# 数据处理
import pandas as pd
import numpy as np
import random as rnd

# 可视化
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# 模型
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

Import Data

Check the size of the data after importing the data

picture

field information

View all fields:

train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

The following are the specific meanings of the fields:

  • PassengerId: user id

  • Survival: whether to survive, 0-no, 1-yes

  • pclass: class, 1-first class, 2-second class, 3-third class

  • name: name

  • sex: gender

  • Age: age

  • sibsp: Number of siblings/spouses on board

  • parch: number of parents/children on board

  • ticket: ticket number

  • fare: fare

  • cabin: Cabin number; cabin number

  • embarked: Embarked location

field classification

There are two main types of data in this case:

  • Categorical: Survived, Sex, and Embarked. Ordinal: Pclass

  • Continuous Continuous: Age, Fare. Discrete: SibSp, Parch

missing value

Check the missing values ​​in the training and test sets:

picture

At the same time, you can also check the basic information of the data through the info function:

picture

data assumptions

Based on the basic information and common sense of the data, the author gives some assumptions of his own and the direction of subsequent data processing and analysis:

delete field

  • This project is mainly to examine the relationship between other fields and the Survival field

  • Focus on fields: Age, Embarked

  • Deleted fields: no effect on data analysis, directly deleted fields: Ticket (ticket number), Cabin (cabin number), PassengerId (passenger number), Name (name)

Modify and add fields

  • Increase Family: According to Parch (the number of brothers and sisters on board) and SibSp (the number of parents and children on board)

  • Extract Title from the Name field as a new feature

  • Convert the Age field into an ordered categorical feature

  • Create a feature based on the Fare Range

guess

  • Women (Sex=female) are more likely to survive

  • Children (Age>?) are more likely to survive

  • Passengers with higher cabin class are more likely to survive (Pclass=1)

Statistical Analysis

Mainly analyze the classified variable Sex, ordered variable Pclss, discrete SibSp, Parch to verify our conjecture

1. Cabin class (1-first class, 2-second class, 3-third class)

picture

Conclusion: People in first class are more likely to survive

2. Gender

picture

Conclusion: Women are more likely to survive

3. Number of siblings/spouse

picture

Conclusion: Passengers with relatively few siblings or spouses are more likely to survive

4. Number of parents/children

picture

Conclusion: When parents and children are 3, it is easier to survive

visual analysis

age and survival

g = sns.FacetGrid(train, col="Survived")
g.map(plt.hist, 'Age', bins=20)

plt.show()

picture

  1. For those who survived, most of them were between 15 and 25 years old (left picture)

  2. The oldest survivor is 80; at the same time, the survival rate of children under the age of 4 is very high (right picture)

  3. Most of the passengers are aged between 15 and 35 (two pictures)

Seat and Survival

grid = sns.FacetGrid(
    train,
    col="Survived",
    row="Pclass",
    size=2.2,
    aspect=1.6
    )

grid.map(plt.hist,"Age",alpha=0.5,bins=20)
grid.add_legend()
plt.show()

picture

  • Class 3 had the most passengers; but many did not survive

  • Passengers in class 1 survived the most

Embarkation location, gender, and survival

grid = sns.FacetGrid(train,
                     row="Embarked",
                     size=2.2,
                     aspect=1.6)
grid.map(sns.pointplot,
         "Pclass",
         "Survived",
         "Sex",
         palette="deep")

grid.add_legend()

plt.show()

picture

  1. Women survive better than men

  2. Except in Embarked=C, males have a higher survival rate.

  3. When the travel class is Pclass=3, the survival rate of males in Embarked=C is better than Q

Fares, Classes and Survival

grid = sns.FacetGrid(train, 
                     row='Embarked', 
                     col='Survived', 
                     size=2.2, aspect=1.6)

grid.map(sns.barplot, 
         'Sex', 
         'Fare', 
         alpha=.5, ci=None)

grid.add_legend()

plt.show()

picture

  • The higher the ticket price, the better the survival effect; the picture 2 on the right

  • The survival rate is related to the boarding position; it is obviously the best in the case of Embarked=C

The above is based on simple statistics and visualization analysis. The following process is based on various machine learning modeling methods for analysis. A lot of preprocessing and feature engineering work has been done in the early stage.

remove invalid fields

The fare ticket and the cabin number Cabin are almost useless for our analysis, you can consider deleting them directly:

picture

generate new features

It is mainly to find a certain relationship based on the existing feature attributes to generate new features, or to perform certain feature attribute transformations.

Field Name processing

Generate and find titles based on the name, such as Lady, Dr, Miss, etc., to check whether there is a relationship between the title and the survival information

# 通过正则提取
for dataset in combine:
    dataset["Title"] = dataset.Name.str.extract('([A-Za-z]+)\.', expand=False)
    
  # 统计Title下的男女数量
train.groupby(["Sex","Title"]).size().reset_index()

picture

Statistics in the form of a crosstab:

# 交叉表形式
pd.crosstab(train['Title'], train['Sex'])

picture

Organize the extracted titles and classify them into common titles and Rare information:

for dataset in combine:
    dataset["Title"] = dataset["Title"].replace(['Lady', 'Countess','Capt', 'Col',\
  'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    
    
# 根据称谓Title求生还的均值
train[["Title","Survived"]].groupby("Title",as_index=False).mean()

picture

The title itself is text-type and is useless for later modeling, so we directly convert it to a numerical type:

title_mapping = {
    
    
  "Mr":1,
  "Miss":2,
  "Mrs":3,
  "Master":4,
  "Rare":5
}

for dataset in combine:
    # 存在数据的进行匹配
    dataset['Title'] = dataset['Title'].map(title_mapping)
    # 不存在则补0
    dataset['Title'] = dataset['Title'].fillna(0)
    
train.head()

At the same time, some fields need to be deleted:

train = train.drop(['Name', 'PassengerId'], axis=1)
test = test.drop(['Name'], axis=1)

combine = [train, test]
train.shape, test.shape

# ((891, 9), (418, 9))

fieldSex

Convert gender Male and Female to 0-Male, 1-Female

 for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map( {
    
    'female': 1, 'male': 0} ).astype(int)

Relationship between sex, age, and survival:

grid = sns.FacetGrid(
    train,
    row='Pclass',
    col='Sex',
    size=2.2, 
    aspect=1.6)

grid.map(plt.hist, 
         'Age', 
         alpha=.5, 
         bins=20)

grid.add_legend()

plt.show()

picture

FieldAge

1. The first is the missing value processing of the field.

We observed that there are missing values ​​in the age field, and we fill them in through six combinations of Sex (0, 1) and Pclass (1, 2, 3). Missing value case:

picture

The specific process of filling:

guess_ages = np.zeros((2,3))

 for dataset in combine:
    for i in range(0,2):
        for j in range(0,3):
            # 找到某种条件下Age字段的缺失值并删除
            guess_df = dataset[(dataset["Sex"] == i) & (dataset["Pclass"] == j+1)]["Age"].dropna()
            age_guess = guess_df.median()  # 中位数
            guess_ages[i,j] = int(age_guess / 0.5 + 0.5) * 0.5
    for i in range(0,2):
        for j in range(0,3):
            dataset.loc[(dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),"Age"] = guess_ages[i,j]
    dataset["Age"] = dataset["Age"].astype(int)
    
# 填充后不存在缺失值
train.isnull().sum()

picture

2. Age division and binning

picture

picture

3. Convert to numerical classification

  • If the age is less than 16, replace it with 0

  • 16 to 32 are replaced by 1 etc...

for dataset in combine:
    dataset.loc[dataset["Age"] <= 16, "Age"] = 0
    dataset.loc[(dataset["Age"] > 16) & (dataset["Age"] <= 32), "Age"] = 1
    dataset.loc[(dataset["Age"] > 32) & (dataset["Age"] <= 48), "Age"] = 2
    dataset.loc[(dataset["Age"] > 48) & (dataset["Age"] <= 64), "Age"] = 3
    dataset.loc[(dataset["Age"] > 64), "Age"] = 4
    
# 删除年龄段AgeBand字段
train = train.drop(["AgeBand"], axis=1)
combine = [train, test]

field handling

Generate new fields based on existing fields:

generate new field 1

First generate a FamilySize field based on the two fields Parch and SibSp

for dataset in combine:
    dataset["FamilySize"] = dataset["SibSp"] + dataset["Parch"] + 1

    
# 每个FamilySize的生还均值
train[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Determine whether it is Islone according to the field FamilySize: If the family member FamilySize is a person, it must be Islone, and it is represented by 1, otherwise it is represented by 0

picture

Finally, delete Parch, SibSp, and FamilySize, and only keep whether a person is Islone:

# 将 Parch, SibSp, and FamilySize删除,仅保留是否一个人Islone

train = train.drop(['Parch', 'SibSp', 'FamilySize'],axis=1)
test = test.drop(['Parch', 'SibSp', 'FamilySize'],axis=1)
combine = [train, test]

train.head()

generate new field 2

New field 2 is the product of Age and Pclass:

picture

Classification of Embarked Fields

The value of the Embarked field has SQC. First we fill in the missing values ​​inside

Check that this field has missing values:

picture

Processing: Find the mode, fill in missing values, and view the mean of each value

picture

Convert text type to numeric type:

picture

Fare field processing

There is no missing value in this field in the training set, and there is one in the test set:

picture

Use the median value for padding:

picture

Carry out binning operations:

# 只对FareBand字段分箱
train['FareBand'] = pd.qcut(train['Fare'], 4)  # 分成4组

# 生还的均值
train[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)

picture

Convert each segment to numeric data:

# 4个分段
for dataset in combine:
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

# 
train = train.drop(['FareBand'], axis=1)
combine = [train, test]
    
test.head()

This way we get the final fields and data for modeling:

picture

modeling

The following is the specific modeling process, we first divide the data set:

# 训练集
X_train = train.drop("Survived", axis=1)
Y_train = train["Survived"]

# 测试集
X_test  = test.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape

The specific process of each model:

  1. Create model instantiated objects

  2. Fit training set

  3. Make predictions on the test set

  4. Calculation accuracy

Model 1: Logistic Regression

# 模型实例化
logreg = LogisticRegression()
# 拟合过程
logreg.fit(X_train, Y_train)

# 测试集预测
Y_pred = logreg.predict(X_test)
# 准确率求解
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log

# 结果
81.37

The coefficients obtained for the logistic regression model:

# 逻辑回归特征和系数


coeff_df = pd.DataFrame(train.columns[1:])  # 除去Survived特征
coeff_df.columns = ["Features"]

coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

# 从高到低
coeff_df.sort_values(by='Correlation', ascending=False)

picture

Conclusion: Gender is really an important factor in our lives

Model 2: Support Vector Machine SVM

picture

Model 3: KNN

picture

Model 4: Naive Bayes

picture

Model 5: Perceptron

picture

Model 6: Linear Support Vector Classification

linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)

Y_pred = linear_svc.predict(X_test)

acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc
# 结果
79.46

Model 7: Stochastic Gradient Descent

picture

Model 8: Decision Tree

picture

Model 9: Random Forest

picture

Model comparison

Compare the results (accuracy) of the above 9 models:

models = pd.DataFrame({
    
    
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})

models.sort_values(by='Score', ascending=False)

picture

By comparing the results: the decision tree and random forest perform best in this data set; the second is the KNN (K nearest neighbor) algorithm.

Guess you like

Origin blog.csdn.net/m0_59596937/article/details/130035518