[] Machine learning Titanic -3-- data cleansing

Engineering and data cleansing features

When we get a data set has a feature that is not all the features are important? There may be many redundant features should be eliminated, we can also be obtained through observation or add new features or extract information from other features.

Age characteristics:

As I mentioned earlier, age is a continuous feature, there is a continuous variable in the model of machine learning problems.

If I say to sports organizations or arrangements by gender, we can easily put them into separate men and women.

If I say are grouped by their age, how would you do? If there are 30 people, 30 of age may have value.

We need continuous values ​​discrete grouping.

Well, the maximum age of passengers is 80 years old. So we will range from 0-80 to the box 5. Therefore, 80/5 = 16.

data['Age_band']=0
data.loc[data['Age']<=16,'Age_band']=0
data.loc[(data['Age']>16)&(data['Age']<=32),'Age_band']=1
data.loc[(data['Age']>32)&(data['Age']<=48),'Age_band']=2
data.loc[(data['Age']>48)&(data['Age']<=64),'Age_band']=3
data.loc[data['Age']>64,'Age_band']=4
data.head(2)

Here Insert Picture Description

data['Age_band'].value_counts().to_frame().style.background_gradient(cmap='summer')#checking the number of passenegers in each band

Here Insert Picture Description

sns.factorplot('Age_band','Survived',data=data,col='Pclass')
plt.show()

Here Insert Picture Description
Increase the survival rate decreases with age, regardless of Pclass.

Family_size: the total number of families

Just look at the elderly and children and brothers and sisters do not quite directly, we look directly at the number of family

data['Family_Size']=0
data['Family_Size']=data['Parch']+data['SibSp']#family size
data['Alone']=0
data.loc[data.Family_Size==0,'Alone']=1#Alone

f,ax=plt.subplots(1,2,figsize=(18,6))
sns.factorplot('Family_Size','Survived',data=data,ax=ax[0])
ax[0].set_title('Family_Size vs Survived')
sns.factorplot('Alone','Survived',data=data,ax=ax[1])
ax[1].set_title('Alone vs Survived')
plt.close(2)
plt.close(3)
plt.show()

Here Insert Picture Description
family_size = 0 means passeneger alone. Obviously, if you are alone or family_size = 0, then the chance of survival is very low. Family size 4 or more, reducing the chance. This seems also an important feature of the model. Let us further examine this issue.

sns.factorplot('Alone','Survived',data=data,hue='Sex',col='Pclass')
plt.show()

Here Insert Picture Description

Ticket prices

Because the fare is also continuous characteristics, so we need to convert it to value.

pandas.qcut

data['Fare_Range']=pd.qcut(data['Fare'],4)
data.groupby(['Fare_Range'])['Survived'].mean().to_frame().style.background_gradient(cmap='summer_r')

Here Insert Picture Description

As mentioned above, we can clearly see, ticket prices increase the chances of survival increase.

data['Fare_cat']=0
data.loc[data['Fare']<=7.91,'Fare_cat']=0
data.loc[(data['Fare']>7.91)&(data['Fare']<=14.454),'Fare_cat']=1
data.loc[(data['Fare']>14.454)&(data['Fare']<=31),'Fare_cat']=2
data.loc[(data['Fare']>31)&(data['Fare']<=513),'Fare_cat']=3
sns.factorplot('Fare_cat','Survived',data=data,hue='Sex')
plt.show()

Here Insert Picture Description
Obviously, with the fare_cat increases, the chance of survival. With the sex change, this feature may become an important feature of the modeling process.

To convert a string value to digital
because we can not put a string of machine learning models

data['Sex'].replace(['male','female'],[0,1],inplace=True)
data['Embarked'].replace(['S','C','Q'],[0,1,2],inplace=True)
data['Initial'].replace(['Mr','Mrs','Miss','Master','Other'],[0,1,2,3,4],inplace=True)

Removal of unnecessary features

Name> We do not need the name attribute, because it can not be converted into any classification value

Age -> We have age_band feature, so you do not need this

Ticket number -> This is an arbitrary string, it can not be categorized

Tickets -> We have fare_cat characteristics, it is not necessary

Funakura number -> This is also not much to complain meaning

passengerid -> can not be categorized

data.drop(['Name','Age','Ticket','Fare','Cabin','Fare_Range','PassengerId'],axis=1,inplace=True)
sns.heatmap(data.corr(),annot=True,cmap='spring_r',linewidths=0.2,annot_kws={'size':20})
fig=plt.gcf()
fig.set_size_inches(18,15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

Here Insert Picture Description

Now correlation diagram above, we can see some positive correlation characteristics. Some people sibsp and family_size and dry family_size and some negative loneliness and family_size them.

Published 116 original articles · won praise 10 · views 1334

Guess you like

Origin blog.csdn.net/weixin_44727383/article/details/105053333