Kaggle泰坦尼克数据探索代码与理解

这几天开始kaggle比赛的学习,首先适合拿来练习的是泰坦尼克号的生还人员推断,由于当时撤退时是按照一定顺序,如老弱优先,所以从有可能从不同乘车人员的年龄,性别,票价,舱位,家中亲人数量等信息推断出该人是否可以生还.


首先载入基本的应用
import pandas as pd #数据分析
import numpy as np #科学计算
from pandas import Series,DataFrame

加载matplotlib画图软件,统计生还者的数量比例
data_train = pd.read_csv("Train.csv")
import matplotlib.pyplot as plt
fig = plt.figure()
fig.set(alpha=0.2)
plt.subplot2grid((2,3),(0,0))
data_train.Survived.value_counts().plot(kind='bar')
plt.ylabel(u"people amount")
plt.title(u"people survival")

这里可以分别统计出不同舱位的人数
plt.subplot2grid((2,3),(0,1))
data_train.Pclass.value_counts().plot(kind="bar")
plt.ylabel(u"people amount")
plt.title(u"people class")


plt.subplot2grid((2,3),(0,2))
plt.scatter(data_train.Survived,data_train.Age)
plt.ylabel(u"age")
plt.grid(b=True,which='major',axis='y')
plt.title(u"survival")

这里可以看到python的方便灵活

看到不同舱位的年龄情况


plt.subplot2grid((2,3),(1,0),colspan=2)
data_train.Age[data_train.Pclass ==1].plot(kind='kde')
data_train.Age[data_train.Pclass ==2].plot(kind='kde')
data_train.Age[data_train.Pclass ==3].plot(kind='kde')
plt.xlabel(u'age')
plt.ylabel(u'density')
plt.title(u'age segment')
plt.legend((u'fist',u'second',u'third'),loc='best')



不同登船口的人数,使用value_counts和plot功能


plt.subplot2grid((2,3),(1,2))
data_train.Embarked.value_counts().plot(kind='bar')
plt.title(u'embarked count')
plt.ylabel(u'count')
plt.show()




fig=plt.figure()
fig.set(alpha=0.2)

观察不同class对生还的影响,分类之后观察


Survived_0 = data_train.Pclass[data_train.Survived == 0].value_counts()
Survived_1 = data_train.Pclass[data_train.Survived == 1].value_counts()
df=pd.DataFrame({u'yes':Survived_1,u'no':Survived_0})
df.plot(kind='bar',stacked=True)
plt.title(u'every class')
plt.xlabel(u'class')
plt.ylabel(u'count')
plt.show()

同上,这次是观察性别的影响
fig=plt.figure()
fig.set(alpha=0.2)
Survived_m = data_train.Survived[data_train.Sex == 'male'].value_counts()
Survived_f = data_train.Survived[data_train.Sex == 'female'].value_counts()
df=pd.DataFrame({u'man':Survived_m,u'woman':Survived_f})
df.plot(kind='bar',stacked=True)
plt.title(u'every sex')
plt.xlabel(u'sex')
plt.ylabel(u'count')

plt.show() 


这里使用特征组合,把性别和舱位组合观察



fig=plt.figure()
fig.set(alpha=0.65)
plt.title(u'class and sex')


ax1=fig.add_subplot(141)
data_train.Survived[data_train.Sex == 'female'][data_train.Pclass != 3].value_counts().plot(kind='bar',label='female high',color='#FA2479')
ax1.set_xticklabels([u'yes',u'no'],rotation=0)
ax1.legend([u'female /first'],loc='best')


ax2=fig.add_subplot(142,sharey=ax1)
data_train.Survived[data_train.Sex == 'female'][data_train.Pclass == 3].value_counts().plot(kind='bar',label='female low',color='pink')
ax2.set_xticklabels([u'no',u'yes'],rotation=0)
plt.legend([u'female /second'],loc='best')


ax3=fig.add_subplot(143,sharey=ax1)
data_train.Survived[data_train.Sex == 'male'][data_train.Pclass != 3].value_counts().plot(kind='bar',label='male high',color='lightblue')
ax3.set_xticklabels([u'no',u'yes'],rotation=0)
plt.legend([u'male /first'],loc='best')


ax4=fig.add_subplot(144,sharey=ax1)
data_train.Survived[data_train.Sex == 'male'][data_train.Pclass  == 3].value_counts().plot(kind='bar',label='male low',color='steelblue')
ax4.set_xticklabels([u'no',u'yes'],rotation=0)
plt.legend([u'male /second'],loc='best')
plt.show()

这里可以把不同的亲人数量和是否生还进行groupby操作,观察两者是否有关系

g=data_train.groupby(['SibSp','Survived']) 
df = pd.DataFrame(g.count()['PassengerId'])
print (df)


g=data_train.groupby(['Parch','Survived'])
df = pd.DataFrame(g.count()['PassengerId'])
print (df)

观察数据,我们还发现有的人没有cabin数据,我们可以观察拥有cabin数据是否和生还有相关性


其中transpose是翻转操作  使用isnull和notnull两种函数判断

fig=plt.figure()
fig.set(alpha=0.2)
Survived_cabin = data_train.Survived[pd.notnull(data_train.Cabin)].value_counts()
Survived_nocabin = data_train.Survived[pd.isnull(data_train.Cabin)].value_counts()
df=pd.DataFrame({u'have':Survived_cabin,u'have not':Survived_nocabin}).transpose()
df.plot(kind='bar',stacked=True)
plt.title(u'whether have cabin')
plt.xlabel(u'have or not')
plt.ylabel(u'count')

plt.show()

观察数据的工作基本就到此为止了,可以根据观察的结果继续进行特征组合和模型建立,能够正确的发现数据之间的联系可以说是建立模型的关键一步.




猜你喜欢

转载自blog.csdn.net/a5139515/article/details/79625488