Data mining process:
(A) data read:
- Read the data, and display
- Statistical data indicators
- To clear the data size and complete the task
(B) Analysis of Characteristics appreciated
- Characteristics of single, individually variable that affects the results of the analysis
- Multivariate statistical analysis, considering the impact of a variety of circumstances
- Statistical graphics concluded
(C) washing the pre-data
- Filling missing values
- Wherein Standardization / normalized
- Screening valuable feature
- Analysis of correlation between feature
(D) model
- Label preparation and characterization data
- Segmentation data set
- A variety of comparative modeling algorithm
- Integrated strategies to improve program
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
data=pd.read_csv('train.csv')
data.head()
To see if there are missing values
data.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
data.describe()
View rescued overall ratio
f,ax=plt.subplots(1,2,figsize=(18,8))
data['Survived'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True,colors=sns.color_palette(palette='cool'))
ax[0].set_title('Survived')
ax[0].set_ylabel('')
sns.countplot('Survived',data=data,ax=ax[1])
ax[1].set_title('Survived')
plt.show()
In the training set of 891 passengers, only about 350 people survived, only 38.4% of the crew survived the crash. We need to dig out from the data more information, see which categories of passengers survived, and which are not.
We will try to use the different characteristics of the data set to examine survival. Such as gender, age, location, etc. boarding
Wherein data is divided into: discrete values and continuous values
-
Discrete values: gender (male and female) Boarding location (S, Q, C)
-
Continuous values: age, ticket prices
data.groupby(['Sex','Survived'])['Survived'].count()
Sex Survived
female 0 81
1 233
male 0 468
1 109
Name: Survived, dtype: int64
f,ax=plt.subplots(1,2,figsize=(18,8))
data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar(ax=ax[0],colors='c')
ax[0].set_title('Survived vs Sex')
sns.countplot('Sex',hue='Survived',data=data,ax=ax[1])
ax[1].set_title('Sex:Survived vs Dead')
plt.show()
Men much more than women on board. However, the number of women save almost twice that of men. Survival was a woman on board was about 75%, while men at around 18-19%.
- Pclass -> Cabin rescued the situation with hierarchical relationships
pd.crosstab(data.Pclass,data.Survived,margins=True).style.background_gradient(cmap='spring')
f,ax=plt.subplots(1,2,figsize=(18,8))
data['Pclass'].value_counts().sort_index().plot.bar(colors=sns.color_palette(palette='hls'),ax=ax[0])
ax[0].set_title('Number Of Passengers By Pclass')
ax[0].set_ylabel('Count')
sns.countplot('Pclass',hue='Survived',data=data,ax=ax[1])
ax[1].set_title('Pclass:Survived vs Dead')
plt.show()
data['Pclass'].value_counts()
3 491
1 216
2 184
Name: Pclass, dtype: int64
1 cabin class is given a high priority and rescue. Although the number of passengers in pClass 3 a lot higher, still survive from them it is very low, about 25%.
For pClass1 for survival is around 63%, while pclass2 is about 48%.
- Impact of cabin class and gender of the results
pd.crosstab([data.Sex,data.Survived],data.Pclass,margins=True).style.background_gradient(cmap='summer_r')
sns.factorplot('Pclass','Survived',hue='Sex',data=data)
plt.show()
We factorplot this figure looks more intuitive.
We can easily infer from pclass1 female survival is 95-96%, as only three of the 94 people rescued from the women did not pclass1.
Obvious that, regardless of pClass, female priority.
It seems Pclass is also an important feature. Let's examine the other features