Machine learning] [Titanic --2- Data Mining

Continuous variables were rescued influence on the situation

  • Age-> Effect of successive values ​​of the characteristics of the results
print('Oldest Passenger was of:',data['Age'].max(),'Years')
print('Youngest Passenger was of:',data['Age'].min(),'Years')
print('Average Age on the ship:',data['Age'].mean(),'Years')

Oldest Passenger was of: 80.0 Years
Youngest Passenger was of: 0.42 Years
Average Age on the ship: 29.69911764705882 Years

f,ax=plt.subplots(1,2,figsize=(16,5))
sns.violinplot("Pclass","Age", hue="Survived", data=data,split=True,ax=ax[0])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0,110,10))
sns.violinplot("Sex","Age", hue="Survived", data=data,split=True,ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0,110,10))
plt.show()

Here Insert Picture Description

Results:
1) the survival rate of children under the age of 10 increases with the number of passenegers.

2) a higher survival probability rescued some of the age of 20-50.

3) For men, with age, reduced survival.

Filling missing values

  • average value
  • Experience
  • Regression model to predict
  • Weed out

As we saw earlier and age there are 177 empty value. In order to replace these missing values, we can allocate the average age of the data sets to them.

But the problem is that there are many people of different ages. The best way is to find a suitable age!

We can check the name of the feature. According to this feature, we can see that there are names like Mr. or Mrs. such names, so that we can assign to the average value of Mr and Mrs respective groups.

for i in data:
    data['Initial']=data.Name.str.extract('([A-Za-z]+)\.') 
data

Here Insert Picture Description

pd.crosstab(data.Initial,data.Sex).T.style.background_gradient(cmap='summer_r')

Here Insert Picture Description

data['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)
ata.groupby('Initial')['Age'].mean()
Initial
Master     4.574167
Miss      21.860000
Mr        32.739609
Mrs       35.981818
Other     45.888889
Name: Age, dtype: float64
## 使用每组的均值来进行填充
data.loc[(data.Age.isnull())&(data.Initial=='Mr'),'Age']=33
data.loc[(data.Age.isnull())&(data.Initial=='Mrs'),'Age']=36
data.loc[(data.Age.isnull())&(data.Initial=='Master'),'Age']=5
data.loc[(data.Age.isnull())&(data.Initial=='Miss'),'Age']=22
data.loc[(data.Age.isnull())&(data.Initial=='Other'),'Age']=46
data.Age.isnull().any()

False
f,ax=plt.subplots(1,2,figsize=(20,10))
data[data['Survived']==0].Age.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='cyan')
ax[0].set_title('Survived= 0')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)
data[data['Survived']==1].Age.plot.hist(ax=ax[1],color='hotpink',bins=20,edgecolor='black')
ax[1].set_title('Survived= 1')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
plt.show()

Here Insert Picture Description
Observed:

1) children (aged 5 years old) rescued still find many (women and children first policy).

2) The oldest passenger was saved (80 years).

3) The death toll is the highest in the age group 30-40.

sns.factorplot('Pclass','Survived',col='Initial',data=data)
plt.show()

Here Insert Picture Description

  • Embarked-> Boarding location
pd.crosstab([data.Embarked,data.Pclass],[data.Sex,data.Survived],margins=True).style.background_gradient(cmap='summer_r')

Here Insert Picture Description

sns.factorplot('Embarked','Survived',data=data)
fig=plt.gcf()
fig.set_size_inches(5,3)
plt.show()

Here Insert Picture Description
The highest likelihood of survival Port C is about 0.55, and the lowest survival rate of S.

f,ax=plt.subplots(2,2,figsize=(20,15))
sns.countplot('Embarked',data=data,ax=ax[0,0])
ax[0,0].set_title('No. Of Passengers Boarded')
sns.countplot('Embarked',hue='Sex',data=data,ax=ax[0,1])
ax[0,1].set_title('Male-Female Split for Embarked')
sns.countplot('Embarked',hue='Survived',data=data,ax=ax[1,0])
ax[1,0].set_title('Embarked vs Survived')
sns.countplot('Embarked',hue='Pclass',data=data,ax=ax[1,1])
ax[1,1].set_title('Embarked vs Pclass')
plt.subplots_adjust(wspace=0.2,hspace=0.5)
plt.show()

Here Insert Picture Description
Here Insert Picture Description
Observed:

1) Most people in the cabin level is 3.

2) C looks very fortunate passengers, some of them survived.

3) S port of the rich find many. The chance of survival is still very low.

4) Q harbor almost 95% of the passengers are poor.

sns.factorplot('Pclass','Survived',hue='Sex',col='Embarked',data=data)
plt.show()

Here Insert Picture Description

Observed:

1) the probability of survival is almost 1 woman in pclass1 and pclass2 in.

Passengers 2) pclass3 of male and female survival rates are very low.

3) port Q is unfortunate because there are 3 passenger cabin and other.

Port also present in missing values, and here I use the mode to fill up, because most people on board S ah

data['Embarked'].fillna('S',inplace=True)
data.Embarked.isnull().any()

False
  • "Quantity siblings - sibsip

This feature represents a person is alone or together with his family.

pd.crosstab([data.SibSp],data.Survived).style.background_gradient(cmap='summer_r')

Here Insert Picture Description

f,ax=plt.subplots(1,2,figsize=(20,8))
sns.barplot('SibSp','Survived',data=data,ax=ax[0])
ax[0].set_title('SibSp vs Survived')
sns.factorplot('SibSp','Survived',data=data,ax=ax[1])
ax[1].set_title('SibSp vs Survived')
plt.close(2)
plt.show()

Here Insert Picture Description

pd.crosstab(data.SibSp,data.Pclass).style.background_gradient(cmap='summer_r')

Here Insert Picture Description

Observed:

barplot and factorplot that if the passenger is not on board lonely brothers and sisters, he had 34.5% survival rate. If the increase in the number of brothers and sisters, the figure is substantially reduced. This makes sense. In other words, if I have a family in the boat, I'll try to save them, rather than to save himself. Surprisingly, however, the 5-8-member family survival rate was 0%. The reason may be that they cabin in pclass = 3?

  • "Quantity parents and children - Parch
pd.crosstab(data.Parch,data.Pclass).style.background_gradient(cmap='summer_r')

Here Insert Picture Description

f,ax=plt.subplots(1,2,figsize=(20,8))
sns.barplot('Parch','Survived',data=data,ax=ax[0])
ax[0].set_title('Parch vs Survived')
sns.factorplot('Parch','Survived',data=data,ax=ax[1])
ax[1].set_title('Parch vs Survived')
plt.close(2)
plt.show()

Here Insert Picture Description

Observed:

The results presented here are also very similar. With parents passengers have a greater chance of survival. However, it decreases with increasing numbers.

1-3 chance of survival in the number of people on board in the home parents is good. Alone proved fatal when the ship has four parents, the chance of survival decreases.

  • Fare-> ticket prices
f,ax=plt.subplots(1,3,figsize=(20,8))
sns.distplot(data[data['Pclass']==1].Fare,ax=ax[0])
ax[0].set_title('Fares in Pclass 1')
sns.distplot(data[data['Pclass']==2].Fare,ax=ax[1])
ax[1].set_title('Fares in Pclass 2')
sns.distplot(data[data['Pclass']==3].Fare,ax=ax[2])
ax[2].set_title('Fares in Pclass 3')
plt.show()

Here Insert Picture Description

Look at all the features summary:
Gender: Compared with men, women's high chance of survival.

Pclass: Yes, first class passengers to give you a better chance of survival of a clear trend. For pclass3 very low survival rate. For women, the chance of survival from pclass1 almost yes.

Age: 5-10 years of age is less than the high survival rate. Passengers aged between 15-35 years of age died a lot.

Port: up positions there are differences, the mortality rate is also great!

Family: There are brothers and sisters 1-2, 1-3 display the spouse or parent rather than alone or have a big family trip, you have a greater probability of survival.

  • Wherein the correlation between

FIG heat Correlates

First thing to note is that only numerical characteristics are compared

Positive correlation: If increasing characteristic features A results in an increase of b, then they were positively correlated. Value of 1 indicates perfect positive correlation.

Negative correlation: If A characteristic increase results in a decrease of the characteristic b, the negative correlation. Value of -1 indicates a perfect negative correlation.

sns.heatmap(data.corr(),annot=True,cmap='rainbow',linewidths=0.2) #data.corr()-->correlation matrix
fig=plt.gcf()
fig.set_size_inches(10,8)
plt.show()

Here Insert Picture Description

Now let's say two properties are highly or perfectly correlated, so a increase leads to another increase. This means that two features are contained similar height information, and information with little or no change. Such a feature for us is of no value!

So you think we should use them at the same time it? . Or training in the production model, we should try to reduce redundancy, because it reduces training time and a lot of advantages.

Now, from the above chart, we can see that the characteristics are not significantly correlated.

Published 116 original articles · won praise 10 · views 1335

Guess you like

Origin blog.csdn.net/weixin_44727383/article/details/105052927