[] Machine learning Titanic --1- Data Mining

Data mining process:

(A) data read:

  • Read the data, and display
  • Statistical data indicators
  • To clear the data size and complete the task

(B) Analysis of Characteristics appreciated

  • Characteristics of single, individually variable that affects the results of the analysis
  • Multivariate statistical analysis, considering the impact of a variety of circumstances
  • Statistical graphics concluded

(C) washing the pre-data

  • Filling missing values
  • Wherein Standardization / normalized
  • Screening valuable feature
  • Analysis of correlation between feature

(D) model

  • Label preparation and characterization data
  • Segmentation data set
  • A variety of comparative modeling algorithm
  • Integrated strategies to improve program
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
data=pd.read_csv('train.csv')
data.head()

Here Insert Picture Description
To see if there are missing values

data.isnull().sum()

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

data.describe()

Here Insert Picture Description
View rescued overall ratio

f,ax=plt.subplots(1,2,figsize=(18,8))
data['Survived'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True,colors=sns.color_palette(palette='cool'))
ax[0].set_title('Survived')
ax[0].set_ylabel('')
sns.countplot('Survived',data=data,ax=ax[1])
ax[1].set_title('Survived')
plt.show()

Here Insert Picture Description

In the training set of 891 passengers, only about 350 people survived, only 38.4% of the crew survived the crash. We need to dig out from the data more information, see which categories of passengers survived, and which are not.

We will try to use the different characteristics of the data set to examine survival. Such as gender, age, location, etc. boarding

Wherein data is divided into: discrete values ​​and continuous values

  • Discrete values: gender (male and female) Boarding location (S, Q, C)

  • Continuous values: age, ticket prices

data.groupby(['Sex','Survived'])['Survived'].count()
Sex     Survived
female  0            81
        1           233
male    0           468
        1           109
Name: Survived, dtype: int64
f,ax=plt.subplots(1,2,figsize=(18,8))
data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar(ax=ax[0],colors='c')
ax[0].set_title('Survived vs Sex')
sns.countplot('Sex',hue='Survived',data=data,ax=ax[1])
ax[1].set_title('Sex:Survived vs Dead')
plt.show()

Here Insert Picture Description

Men much more than women on board. However, the number of women save almost twice that of men. Survival was a woman on board was about 75%, while men at around 18-19%.

  • Pclass -> Cabin rescued the situation with hierarchical relationships
pd.crosstab(data.Pclass,data.Survived,margins=True).style.background_gradient(cmap='spring')

Here Insert Picture Description

f,ax=plt.subplots(1,2,figsize=(18,8))
data['Pclass'].value_counts().sort_index().plot.bar(colors=sns.color_palette(palette='hls'),ax=ax[0])
ax[0].set_title('Number Of Passengers By Pclass')
ax[0].set_ylabel('Count')
sns.countplot('Pclass',hue='Survived',data=data,ax=ax[1])
ax[1].set_title('Pclass:Survived vs Dead')
plt.show()

Here Insert Picture Description

data['Pclass'].value_counts()
3    491
1    216
2    184
Name: Pclass, dtype: int64

1 cabin class is given a high priority and rescue. Although the number of passengers in pClass 3 a lot higher, still survive from them it is very low, about 25%.

For pClass1 for survival is around 63%, while pclass2 is about 48%.

  • Impact of cabin class and gender of the results
pd.crosstab([data.Sex,data.Survived],data.Pclass,margins=True).style.background_gradient(cmap='summer_r')

Here Insert Picture Description

sns.factorplot('Pclass','Survived',hue='Sex',data=data)
plt.show()

Here Insert Picture Description

We factorplot this figure looks more intuitive.

We can easily infer from pclass1 female survival is 95-96%, as only three of the 94 people rescued from the women did not pclass1.

Obvious that, regardless of pClass, female priority.

It seems Pclass is also an important feature. Let's examine the other features

Published 116 original articles · won praise 10 · views 1337

Guess you like

Origin blog.csdn.net/weixin_44727383/article/details/105052655