Continuous variables were rescued influence on the situation
- Age-> Effect of successive values of the characteristics of the results
print('Oldest Passenger was of:',data['Age'].max(),'Years')
print('Youngest Passenger was of:',data['Age'].min(),'Years')
print('Average Age on the ship:',data['Age'].mean(),'Years')
Oldest Passenger was of: 80.0 Years
Youngest Passenger was of: 0.42 Years
Average Age on the ship: 29.69911764705882 Years
f,ax=plt.subplots(1,2,figsize=(16,5))
sns.violinplot("Pclass","Age", hue="Survived", data=data,split=True,ax=ax[0])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0,110,10))
sns.violinplot("Sex","Age", hue="Survived", data=data,split=True,ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0,110,10))
plt.show()
Results:
1) the survival rate of children under the age of 10 increases with the number of passenegers.
2) a higher survival probability rescued some of the age of 20-50.
3) For men, with age, reduced survival.
Filling missing values
- average value
- Experience
- Regression model to predict
- Weed out
As we saw earlier and age there are 177 empty value. In order to replace these missing values, we can allocate the average age of the data sets to them.
But the problem is that there are many people of different ages. The best way is to find a suitable age!
We can check the name of the feature. According to this feature, we can see that there are names like Mr. or Mrs. such names, so that we can assign to the average value of Mr and Mrs respective groups.
for i in data:
data['Initial']=data.Name.str.extract('([A-Za-z]+)\.')
data
pd.crosstab(data.Initial,data.Sex).T.style.background_gradient(cmap='summer_r')
data['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)
ata.groupby('Initial')['Age'].mean()
Initial
Master 4.574167
Miss 21.860000
Mr 32.739609
Mrs 35.981818
Other 45.888889
Name: Age, dtype: float64
## 使用每组的均值来进行填充
data.loc[(data.Age.isnull())&(data.Initial=='Mr'),'Age']=33
data.loc[(data.Age.isnull())&(data.Initial=='Mrs'),'Age']=36
data.loc[(data.Age.isnull())&(data.Initial=='Master'),'Age']=5
data.loc[(data.Age.isnull())&(data.Initial=='Miss'),'Age']=22
data.loc[(data.Age.isnull())&(data.Initial=='Other'),'Age']=46
data.Age.isnull().any()
False
f,ax=plt.subplots(1,2,figsize=(20,10))
data[data['Survived']==0].Age.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='cyan')
ax[0].set_title('Survived= 0')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)
data[data['Survived']==1].Age.plot.hist(ax=ax[1],color='hotpink',bins=20,edgecolor='black')
ax[1].set_title('Survived= 1')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
plt.show()
Observed:
1) children (aged 5 years old) rescued still find many (women and children first policy).
2) The oldest passenger was saved (80 years).
3) The death toll is the highest in the age group 30-40.
sns.factorplot('Pclass','Survived',col='Initial',data=data)
plt.show()
- Embarked-> Boarding location
pd.crosstab([data.Embarked,data.Pclass],[data.Sex,data.Survived],margins=True).style.background_gradient(cmap='summer_r')
sns.factorplot('Embarked','Survived',data=data)
fig=plt.gcf()
fig.set_size_inches(5,3)
plt.show()
The highest likelihood of survival Port C is about 0.55, and the lowest survival rate of S.
f,ax=plt.subplots(2,2,figsize=(20,15))
sns.countplot('Embarked',data=data,ax=ax[0,0])
ax[0,0].set_title('No. Of Passengers Boarded')
sns.countplot('Embarked',hue='Sex',data=data,ax=ax[0,1])
ax[0,1].set_title('Male-Female Split for Embarked')
sns.countplot('Embarked',hue='Survived',data=data,ax=ax[1,0])
ax[1,0].set_title('Embarked vs Survived')
sns.countplot('Embarked',hue='Pclass',data=data,ax=ax[1,1])
ax[1,1].set_title('Embarked vs Pclass')
plt.subplots_adjust(wspace=0.2,hspace=0.5)
plt.show()
Observed:
1) Most people in the cabin level is 3.
2) C looks very fortunate passengers, some of them survived.
3) S port of the rich find many. The chance of survival is still very low.
4) Q harbor almost 95% of the passengers are poor.
sns.factorplot('Pclass','Survived',hue='Sex',col='Embarked',data=data)
plt.show()
Observed:
1) the probability of survival is almost 1 woman in pclass1 and pclass2 in.
Passengers 2) pclass3 of male and female survival rates are very low.
3) port Q is unfortunate because there are 3 passenger cabin and other.
Port also present in missing values, and here I use the mode to fill up, because most people on board S ah
data['Embarked'].fillna('S',inplace=True)
data.Embarked.isnull().any()
False
- "Quantity siblings - sibsip
This feature represents a person is alone or together with his family.
pd.crosstab([data.SibSp],data.Survived).style.background_gradient(cmap='summer_r')
f,ax=plt.subplots(1,2,figsize=(20,8))
sns.barplot('SibSp','Survived',data=data,ax=ax[0])
ax[0].set_title('SibSp vs Survived')
sns.factorplot('SibSp','Survived',data=data,ax=ax[1])
ax[1].set_title('SibSp vs Survived')
plt.close(2)
plt.show()
pd.crosstab(data.SibSp,data.Pclass).style.background_gradient(cmap='summer_r')
Observed:
barplot and factorplot that if the passenger is not on board lonely brothers and sisters, he had 34.5% survival rate. If the increase in the number of brothers and sisters, the figure is substantially reduced. This makes sense. In other words, if I have a family in the boat, I'll try to save them, rather than to save himself. Surprisingly, however, the 5-8-member family survival rate was 0%. The reason may be that they cabin in pclass = 3?
- "Quantity parents and children - Parch
pd.crosstab(data.Parch,data.Pclass).style.background_gradient(cmap='summer_r')
f,ax=plt.subplots(1,2,figsize=(20,8))
sns.barplot('Parch','Survived',data=data,ax=ax[0])
ax[0].set_title('Parch vs Survived')
sns.factorplot('Parch','Survived',data=data,ax=ax[1])
ax[1].set_title('Parch vs Survived')
plt.close(2)
plt.show()
Observed:
The results presented here are also very similar. With parents passengers have a greater chance of survival. However, it decreases with increasing numbers.
1-3 chance of survival in the number of people on board in the home parents is good. Alone proved fatal when the ship has four parents, the chance of survival decreases.
- Fare-> ticket prices
f,ax=plt.subplots(1,3,figsize=(20,8))
sns.distplot(data[data['Pclass']==1].Fare,ax=ax[0])
ax[0].set_title('Fares in Pclass 1')
sns.distplot(data[data['Pclass']==2].Fare,ax=ax[1])
ax[1].set_title('Fares in Pclass 2')
sns.distplot(data[data['Pclass']==3].Fare,ax=ax[2])
ax[2].set_title('Fares in Pclass 3')
plt.show()
Look at all the features summary:
Gender: Compared with men, women's high chance of survival.
Pclass: Yes, first class passengers to give you a better chance of survival of a clear trend. For pclass3 very low survival rate. For women, the chance of survival from pclass1 almost yes.
Age: 5-10 years of age is less than the high survival rate. Passengers aged between 15-35 years of age died a lot.
Port: up positions there are differences, the mortality rate is also great!
Family: There are brothers and sisters 1-2, 1-3 display the spouse or parent rather than alone or have a big family trip, you have a greater probability of survival.
- Wherein the correlation between
FIG heat Correlates
First thing to note is that only numerical characteristics are compared
Positive correlation: If increasing characteristic features A results in an increase of b, then they were positively correlated. Value of 1 indicates perfect positive correlation.
Negative correlation: If A characteristic increase results in a decrease of the characteristic b, the negative correlation. Value of -1 indicates a perfect negative correlation.
sns.heatmap(data.corr(),annot=True,cmap='rainbow',linewidths=0.2) #data.corr()-->correlation matrix
fig=plt.gcf()
fig.set_size_inches(10,8)
plt.show()
Now let's say two properties are highly or perfectly correlated, so a increase leads to another increase. This means that two features are contained similar height information, and information with little or no change. Such a feature for us is of no value!
So you think we should use them at the same time it? . Or training in the production model, we should try to reduce redundancy, because it reduces training time and a lot of advantages.
Now, from the above chart, we can see that the characteristics are not significantly correlated.