泰坦尼克号生存预测(二)-- 特征分析

5. 特征再分析

对处理过的数据再分析

train[['Survived','Pclass','Sex','Age_level','Fare_log','Embarked','Familysize','isAlone','Has_Cabin','Title']].groupby('Survived',as_index=False).mean()
  Survived Pclass Sex Age_level Fare_log Embarked Familysize isAlone Has_Cabin Title
0 0 2.531876 0.852459 2.653916 1.652095 1.307832 1.883424 0.681239 0.123862 4.500911
1 1 1.950292 0.318713 2.587719 2.198830 1.447368 1.938596 0.476608 0.397661 2.786550

由上述数据可知:

1) Pclass船票等级1、2级生存率相对较高(Survived=1),而3、4级生存率较低;

2) Sex性别女(Sex=0)生存率比男性(Sex=1)生存率高;

3) Age_level年龄小的生存率相对较高;

4) Fare_log消费高的有更高的生存率;

5) isAlone(1)独自一人上船比有家人陪同(0)的生存率低;

6) Has_Cabin拥有船舱生存率更高.

6. 相关性分析-多变量分析

a. 直方图和数据透视表

f, (axis1,axis2) = plt.subplots(1,2,figsize=(18,6))
sns.barplot(x='Embarked',y='Survived',hue='Sex',data=train,ax=axis1)
sns.barplot(x='Age_level',y='Survived',hue='Sex',data=train,ax=axis2)

发现:

1)性别对生存率影响非常明显,女性(Sex=0)普遍比男性有更高的生存率;

2)年龄(50+)的女性和14岁以下的小孩(age_level=1)有更高的生存率;

3)不同于其他组别的男性低生存率,14岁以下的男童生存率较高;

4)启用新的一类特征:年轻男孩“boy”

双因子结合分析:

for dataset in full_data:
    dataset['Embarked_gender']=0
    dataset.loc[(dataset['Embarked']==1)&(dataset['Sex']==0),'Embarked_gender']=1
    dataset.loc[(dataset['Embarked']==2)&(dataset['Sex']==0),'Embarked_gender']=2
    dataset.loc[(dataset['Embarked']==3)&(dataset['Sex']==0),'Embarked_gender']=3
    dataset.loc[(dataset['Embarked']==1)&(dataset['Sex']==1),'Embarked_gender']=4
    dataset.loc[(dataset['Embarked']==2)&(dataset['Sex']==1),'Embarked_gender']=5
    dataset.loc[(dataset['Embarked']==3)&(dataset['Sex']==1),'Embarked_gender']=6

train[['Embarked_gender','Survived']].groupby('Embarked_gender',as_index=False).mean().sort_values(by='Survived',ascending=False)
  Embarked_gender Survived
1 2 0.876712
2 3 0.750000
0 1 0.692683
4 5 0.305263
3 4 0.174603
5 6 0.073171

数据透视表:

train_pivot = pd.pivot_table(train, index=['Familysize','Pclass'], columns='Sex',values='Survived',aggfunc=np.mean,margins=True)
def get_color(val):
    color = 'red' if val <0.4 else 'black'
    return 'color:%s'%color
train_pivot = train_pivot.style.applymap(get_color)

# 绘制多个同坐标轴图形
grid = sns.FacetGrid(train,row = 'Pclass',col='Survived',size=2,aspect=3)
grid.map(plt.hist, 'Age_level')
grid.add_legend()

由以上分析可知:

1)Familysize为1,2的男性存活率均低于0.4;

2)独自一人且船票等级为2和3的乘客存活率低于0.4;

3)一等船票年龄在14~40之间的乘客存活的人数较多;

4)3等船票年龄在14~30之间的乘客去世的人数最多。

f,(axis1,axis2,axis3)=plt.subplots(1,3,figsize=(18,8))
sns.violinplot(x='Pclass',y='Fare_log',hue='Survived',data=train,ax=axis1)
axis1.set_title('Pclass vs Fare Survival comparison')
sns.violinplot(x='Pclass',y='Age_level',hue='Survived',data=train,ax=axis2)
axis2.set_title('Pclass vs Age_level Survival comparison')
sns.violinplot(x='Pclass',y='Familysize',hue='Survived',data=train,ax=axis3)
axis3.set_title('Pclass vs Familysize Survival comparison')

f, axis = plt.subplots(2,3,figsize=(18,8))
sns.barplot(x='Embarked',y='Survived',data=train,ax=axis[0,0])
sns.barplot(x='Pclass',y='Survived',data=train,ax=axis[0,1],order=[1,2,3])
sns.barplot(x='Deck',y='Survived',data=train,ax=axis[0,2],order=[0,1,2])

sns.pointplot(x='Fare_log',y='Survived',data=train,ax=axis[1,0])
sns.pointplot(x='Age_level',y='Survived',data=train,ax=axis[1,1])
sns.pointplot(x='Familysize',y='Survived',data=train,ax=axis[1,2])

b. 去掉无关特征

drop_elements = ['PassengerId','Name','Age', 'Ticket','Fare','Cabin', 'isAlone', 'boy']
train=train.drop(drop_elements,axis=1)
test=test.drop(drop_elements,axis=1)

皮尔逊相关性热力图

colormap = plt.cm.RdBu
plt.figure(figsize=(14,12))
plt.title('Pearson Correlation of Features',y=1.05,size=15)
sns.heatmap(train.astype(float).corr(),linewidth=0.1,vmax=1.0,square=True,cmap=colormap,linecolor='white',annot=True)

相关性在0.5~0.7之间的通常认为有中度偏强的相关性

g = sns.pairplot(train[['Survived','Pclass','Sex','Age_level','Fare_log','Familysize','Title']],hue='Survived',palette='seismic',size=1.2,diag_kind='kde',diag_kws=dict(shade=True),plot_kws=dict(s=10))
# 隐去X轴刻度标签
g.set(xticklabels=[])

以上pairplot的散点图可以看见双因子对存活率的影响:例如同行家庭成员较少同时船票等级为1等2等时全是红色散点,说明这两个因素同时存在时存活率较高。

下一篇将着重于模型预测~

猜你喜欢

转载自blog.csdn.net/huangxiaoyun1900/article/details/82684843