Engineering and data cleansing features
When we get a data set has a feature that is not all the features are important? There may be many redundant features should be eliminated, we can also be obtained through observation or add new features or extract information from other features.
Age characteristics:
As I mentioned earlier, age is a continuous feature, there is a continuous variable in the model of machine learning problems.
If I say to sports organizations or arrangements by gender, we can easily put them into separate men and women.
If I say are grouped by their age, how would you do? If there are 30 people, 30 of age may have value.
We need continuous values discrete grouping.
Well, the maximum age of passengers is 80 years old. So we will range from 0-80 to the box 5. Therefore, 80/5 = 16.
data['Age_band']=0
data.loc[data['Age']<=16,'Age_band']=0
data.loc[(data['Age']>16)&(data['Age']<=32),'Age_band']=1
data.loc[(data['Age']>32)&(data['Age']<=48),'Age_band']=2
data.loc[(data['Age']>48)&(data['Age']<=64),'Age_band']=3
data.loc[data['Age']>64,'Age_band']=4
data.head(2)
data['Age_band'].value_counts().to_frame().style.background_gradient(cmap='summer')#checking the number of passenegers in each band
sns.factorplot('Age_band','Survived',data=data,col='Pclass')
plt.show()
Increase the survival rate decreases with age, regardless of Pclass.
Family_size: the total number of families
Just look at the elderly and children and brothers and sisters do not quite directly, we look directly at the number of family
data['Family_Size']=0
data['Family_Size']=data['Parch']+data['SibSp']#family size
data['Alone']=0
data.loc[data.Family_Size==0,'Alone']=1#Alone
f,ax=plt.subplots(1,2,figsize=(18,6))
sns.factorplot('Family_Size','Survived',data=data,ax=ax[0])
ax[0].set_title('Family_Size vs Survived')
sns.factorplot('Alone','Survived',data=data,ax=ax[1])
ax[1].set_title('Alone vs Survived')
plt.close(2)
plt.close(3)
plt.show()
family_size = 0 means passeneger alone. Obviously, if you are alone or family_size = 0, then the chance of survival is very low. Family size 4 or more, reducing the chance. This seems also an important feature of the model. Let us further examine this issue.
sns.factorplot('Alone','Survived',data=data,hue='Sex',col='Pclass')
plt.show()
Ticket prices
Because the fare is also continuous characteristics, so we need to convert it to value.
pandas.qcut
data['Fare_Range']=pd.qcut(data['Fare'],4)
data.groupby(['Fare_Range'])['Survived'].mean().to_frame().style.background_gradient(cmap='summer_r')
As mentioned above, we can clearly see, ticket prices increase the chances of survival increase.
data['Fare_cat']=0
data.loc[data['Fare']<=7.91,'Fare_cat']=0
data.loc[(data['Fare']>7.91)&(data['Fare']<=14.454),'Fare_cat']=1
data.loc[(data['Fare']>14.454)&(data['Fare']<=31),'Fare_cat']=2
data.loc[(data['Fare']>31)&(data['Fare']<=513),'Fare_cat']=3
sns.factorplot('Fare_cat','Survived',data=data,hue='Sex')
plt.show()
Obviously, with the fare_cat increases, the chance of survival. With the sex change, this feature may become an important feature of the modeling process.
To convert a string value to digital
because we can not put a string of machine learning models
data['Sex'].replace(['male','female'],[0,1],inplace=True)
data['Embarked'].replace(['S','C','Q'],[0,1,2],inplace=True)
data['Initial'].replace(['Mr','Mrs','Miss','Master','Other'],[0,1,2,3,4],inplace=True)
Removal of unnecessary features
Name> We do not need the name attribute, because it can not be converted into any classification value
Age -> We have age_band feature, so you do not need this
Ticket number -> This is an arbitrary string, it can not be categorized
Tickets -> We have fare_cat characteristics, it is not necessary
Funakura number -> This is also not much to complain meaning
passengerid -> can not be categorized
data.drop(['Name','Age','Ticket','Fare','Cabin','Fare_Range','PassengerId'],axis=1,inplace=True)
sns.heatmap(data.corr(),annot=True,cmap='spring_r',linewidths=0.2,annot_kws={'size':20})
fig=plt.gcf()
fig.set_size_inches(18,15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()
Now correlation diagram above, we can see some positive correlation characteristics. Some people sibsp and family_size and dry family_size and some negative loneliness and family_size them.