关于TiTanic存活预测实战(一、数据分析)

一、前言

  虽然一直算IT男,但是基本没有接触过最前沿的IT知识,一直在做生产方面的IT,突发奇想,开始学习算法,学习算法有半年多了,从最初的Python,到线性回归、逻辑回归、SVM,聚类,NLP,CNN,RNN,GAN等神经网络,感觉知识的海洋真是浩瀚如海,今天打算开始分享一下我的一些学习情况,第一个当然就是最基础的泰坦尼克存活预测啦。

二、背景介绍

  背景介绍:泰坦尼克号沉没是历史上最著名的沉船事故之一。1912年4月15日,在她的处女航中,泰坦尼克号在与冰山相撞后沉没,在2224名乘客和机组人员中造成1502人死亡。这场耸人听闻的悲剧震惊了国际社会,并为船舶制定了更好的安全规定。造成海难失事的原因之一是乘客和机组人员没有足够的救生艇。尽管幸存下沉有一些运气因素,但有些人比其他人更容易生存,例如妇女,儿童和上流社会。在这个案例中我们将运用机器学习来预测哪些乘客可以存活。

  数据介绍:PassengerId:乘客ID,Survived:是否存活,0是死了1是活了,Pclass:船舱等级,Name:名字,Sex:性别,Age:年纪,SibSp:有几个兄弟姐妹,Parch:父母和小孩个数,Ticket:船票,Fare:船票价格,Cabin:客舱,Embarked:出发港口

  实战介绍:这是一个kaggle比赛的一个题目,我应用的是Python3.7+Anaconda3进行操作的,主要分三步,数据分析、数据清洗和建模预测

三、数据集及代码

  https://pan.baidu.com/s/1JuCWhOEgvAV6gocicddQ4A       提取码:1t9w

四、实战

  a、数据分析

    1、导入pandas和numpy库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号

    2、加载数据

data_train = pd.read_csv('train.csv')
data_test = pd.read_csv('test.csv')

    3、查看数据的整体情况

data_train.shape#查看训练集的shape
data_test.shape#查看测试集的shape
data_train.head(4)#查看一下前几行

     

data_train.info()#查看个数,空值情况以及数据类型

    

      4、针对列业务数据进行单独分析  

    Pclass

data_train.Pclass.value_counts()#查看船舱等级情况

data_train.Sex.isnull().sum()#看看是否有空值

#画图查看和存活有关系吗
fig = plt.figure() fig.set(alpha=0.65) ax = fig.add_subplot(3,3,1) Survived_1 = data_train.Pclass[data_train.Survived == 1].value_counts() Survived_0 = data_train.Pclass[data_train.Survived == 0].value_counts() df_Pclass = pd.DataFrame({"Survived_1":Survived_1,"Survived_0":Survived_0}) df_Pclass.plot(kind='bar',stacked = True) df_Pclass.plot(kind='bar',stacked = False)

#得到信息,1等的活下来的概率高一点

    Name

#这个我感觉没啥可分析的 

    SEX

data_train.Sex.value_counts()#查看每类的个数

data_train.Sex.isnull().sum()#查看空值情况

#查看性别和存活的关系
fig = plt.figure() Survived_0 = data_train.Sex[data_train.Survived==0].value_counts() Survived_1 = data_train.Sex[data_train.Survived==1].value_counts() df_sex = pd.DataFrame({'Survived_0':Survived_0,'Survived_1':Survived_1}) df_sex.plot(kind='bar',stacked=True) plt.show()

    Age

fig = plt.figure()
Survived_0 = data_train.Age[data_train.Survived==0].value_counts()
Survived_1 = data_train.Age[data_train.Survived==1].value_counts()
df_sex = pd.DataFrame({'Survived_0':Survived_0,'Survived_1':Survived_1})
df_sex.plot(kind='kde',stacked=True)
# plt.scatter(Survived_0.index,Survived_0.values)
# plt.scatter(Survived_1.index,Survived_1.values)
plt.show()

#分段看看
def
get_age(age): if 0<age <= 8: return 0 if 8<age<=15: return 1 if 15<age<=22: return 2 if 22<age<=30: return 3 if 30<age<=38: return 4 if 38<age<=48: return 5 if 48<age<=58: return 6 if 58<age: return 7 else: return 8
data_train.Age = data_train.Age.apply(get_age)data_train.Age = data_train.Age.apply(get_age)
fig = plt.figure()
Survived_0 = data_train.Age[data_train.Survived==0].value_counts()
Survived_1 = data_train.Age[data_train.Survived==1].value_counts()
df_sex = pd.DataFrame({'Survived_0':Survived_0,'Survived_1':Survived_1})
# df_sex.plot(kind='bar',stacked=True)
df_sex.plot(kind='bar',stacked=False)
# plt.scatter(Survived_0.index,Survived_0.values)
# plt.scatter(Survived_1.index,Survived_1.values)
plt.show()

#年龄小 存活率大,但是也不一定 第三个栏位,28到38的也不少活着,后来一细分 发现20来岁的小伙子 活的概率也不高,看样子得做一下onehot

    SibSp 有几个兄弟姐妹

data_train.SibSp.value_counts()#看看分类情况

fig = plt.figure()

Survived_0 = data_train.SibSp[data_train.Survived ==0].value_counts()
Survived_1 = data_train.SibSp[data_train.Survived ==1].value_counts()
df_sibsp = pd.DataFrame({"Survived_0":Survived_0,'Survived_1':Survived_1})
df_sibsp.plot(kind='bar',stacked=True)
plt.show()

#看起来没有的容易死,一个的和两个的活的概率高,大于三个的基本就死了

    Parch 父母与小孩个数

data_train.Parch.value_counts()
fig = plt.figure()
Survived_0 = data_train.Parch[data_train.Survived ==0].value_counts()
Survived_1 = data_train.Parch[data_train.Survived ==1].value_counts()
df_parch = pd.DataFrame({'Survived_0':Survived_0,'Survived_1':Survived_1})
df_parch.plot(kind = 'bar',stacked = True)
df_parch.plot(kind = 'bar',stacked = False)

#还是独生子死亡率高,有一个或者两个的三个的反而基本都死了

    Ticket

#车票 这么多种类啊 ,看着头疼,不要了

    Fare

def get_fare(fare):
    if  fare<=8:
        return 0
    elif 8<fare<=14:
        return 1
    elif 14<fare<=30:
        return 2
    elif 30<fare<=60:
        return 3
    elif 60<fare:
        return 4
data_train.Fare = data_train.Fare.apply(get_fare)
fig = plt.figure()
Survived_0 = data_train.Fare[data_train.Survived ==0].value_counts()
Survived_1 = data_train.Fare[data_train.Survived ==1].value_counts()
df_fare = pd.DataFrame({'Survived_0':Survived_0,'Survived_1':Survived_1})
df_fare.plot(kind = 'bar')

#通过这次,可以基本确定,这个车票买的越贵,人越容易存活

    Cabin 客舱

data_train.Cabin.isnull().sum()
#这玩意也好多种类哟,又不是数字,还不好分段,还好多是空的,也可以分析一下,空和不空与死活有关系吗

data_train.loc[ (data_train.Cabin.notnull()), 'Cabin' ] = "Yes"
data_train.loc[ (data_train.Cabin.isnull()), 'Cabin' ] = "No"
Survived_0 = data_train.Cabin[data_train.Survived ==0].value_counts()
Survived_1 = data_train.Cabin[data_train.Survived ==1].value_counts()
df_fare = pd.DataFrame({'Survived_0':Survived_0,'Survived_1':Survived_1})
df_fare.plot(kind = 'bar',stacked=True)

#这么看起来 貌似有点规律哟

    Embarked

data_train.Embarked.value_counts()

Survived_0 = data_train.Embarked[data_train.Survived ==0].value_counts()
Survived_1 = data_train.Embarked[data_train.Survived ==1].value_counts()
df_fare = pd.DataFrame({'Survived_0':Survived_0,'Survived_1':Survived_1})
df_fare.plot(kind = 'bar',stacked=True)
df_fare.plot(kind = 'bar',stacked=False)

#C港的有点意思,可以能有一半多,其他俩看不出多大的规律来

    

至此,单个的分析已经完成,看看可以组合分析一下

fig = plt.figure(figsize=(15,20))
# fig.set(alpha=0.2)
plt.subplot(5,3,1)
data_train.Survived.value_counts().plot(kind='bar')
plt.title("Survived(1 is Survived)") # puts a title on our graph
plt.ylabel("counts")

plt.subplot(5,3,2)
data_train.Pclass.value_counts().plot(kind='bar')
plt.title("船舱等级情况")
plt.ylabel('counts')

plt.subplot(5,3,3)
data_train.Sex.value_counts().plot(kind='bar')
plt.title('male or female')
plt.ylabel('counts')

plt.subplot(5,3,4)
data_train.Age.value_counts().plot(kind='kde')
plt.title('age')
plt.ylabel('counts')

plt.subplot(5,3,5)
plt.scatter(data_train.Survived, data_train.Age)
plt.ylabel(u"年龄")                         # sets the y axis lable
plt.grid(b=True, which='major', axis='y') # formats the grid line style of our graphs
plt.title(u"按年龄看获救分布 (1为获救)")

plt.subplot(5,3,6)
data_train.Parch.value_counts().plot(kind='bar')
plt.title("兄弟姐妹个数")
plt.ylabel("人数")

plt.subplot(5,3,7)
plt.scatter(data_train.Survived,data_train.Fare)
plt.title("0:死,1:活")
plt.ylabel("船票价格")
plt.subplot(5,3,8)
data_train.Fare.value_counts().plot(kind='kde')
plt.title("船票价格")
plt.ylabel("人员分布")

plt.subplot(5,3,8)
data_train.Embarked.value_counts().plot(kind='bar')
plt.title("港口情况")
plt.ylabel("人员分布")

   先看一下船舱等级和性别有关系吗 Pclass Sex

fig = plt.figure()
p1_m = data_train.Sex[data_train.Sex == 'male'][data_train.Pclass == 1].value_counts()
p2_m = data_train.Sex[data_train.Sex == 'male'][data_train.Pclass == 2].value_counts()
p3_m = data_train.Sex[data_train.Sex == 'male'][data_train.Pclass == 3].value_counts()
pd.DataFrame({'p1_m':p1_m,'p2_m':p2_m,'p3_m':p3_m}).plot(kind='bar')

p1_f = data_train.Sex[data_train.Sex == 'female'][data_train.Pclass == 1].value_counts()
p2_f = data_train.Sex[data_train.Sex == 'female'][data_train.Pclass == 2].value_counts()
p3_f = data_train.Sex[data_train.Sex == 'female'][data_train.Pclass == 3].value_counts()
pd.DataFrame({'p1_f':p1_f,'p2_f':p2_f,'p3_f':p3_f}).plot(kind='bar')

#整体来看,没啥关系,男的在高等的人数多点 比例稍微差点,但是感觉不怎么明显

    看看船舱和船票价格之间的关系 Pclass Fare

df_pf1 = data_train.Fare[data_train.Pclass==1].value_counts()
df_pf2 = data_train.Fare[data_train.Pclass==2].value_counts()
df_pf3 = data_train.Fare[data_train.Pclass==3].value_counts()
df_pf = pd.DataFrame({'df_pf1':df_pf1,'df_pf2':df_pf2,'df_pf3':df_pf3})
plt.scatter(df_pf1.index,df_pf1.values,c='r',marker='.')
# plt.subplot(1,3,2)
plt.scatter(df_pf2.index,df_pf2.values,c = 'y',marker='.')
# plt.subplot(1,3,3)
plt.scatter(df_pf3.index,df_pf3.values,c = 'k',marker='.')

#一等票价高,三等的和二等的票价低,当然二等的比三等的票价高一点,符合相关逻辑,即票和船舱等级是正相关的

    看看船舱和出发地有关系吗,是不是不同的地方人有钱的程度不一样啊 Pclass,Em

fig = plt.figure(figsize=(10,8))
df_pm1 = data_train.Embarked[data_train.Pclass==1].value_counts()
df_pm2 = data_train.Embarked[data_train.Pclass==2].value_counts()
df_pm3 = data_train.Embarked[data_train.Pclass==3].value_counts()
df_pm = pd.DataFrame({'df_pm1':df_pm1,'df_pm2':df_pm2,'df_pm3':df_pm3})
df_pm.plot(kind='bar')

#从这里可以看出来,C港出发的人,貌似一等多,Q的三等的多,S的较为正常

    看看船舱和年纪有关系不

fig = plt.figure(figsize=(18,8))

df_pa1 = data_train.Age[data_train.Pclass==1].value_counts()
df_pa2 = data_train.Age[data_train.Pclass==2].value_counts()
df_pa3 = data_train.Age[data_train.Pclass==3].value_counts()
df_pa = pd.DataFrame({'df_pa1':df_pa1,'df_pa2':df_pa2,'df_pm3':df_pa3})
# plt.subplot(1,3,1)
plt.scatter(df_pa1.index,df_pa1.values,c='r',marker='.')
# plt.subplot(1,3,2)
plt.scatter(df_pa2.index,df_pa2.values,c = 'y',marker='.')
# plt.subplot(1,3,3)
plt.scatter(df_pa3.index,df_pa3.values,c = 'k',marker='.')

#图好难看啊 ,但是 虽然难看,但是可以稍微看到,黑的 也就是三等的,年轻人18—25之间十分的集中

    Age 和Sex

df_as1 = data_train.Age[data_train.Sex=='male'].value_counts()
df_as2 = data_train.Age[data_train.Sex=='female'].value_counts()
# plt.subplot(1,3,1)
plt.scatter(df_as1.index,df_as1.values,c='r',marker='.')
# plt.subplot(1,3,2)
plt.scatter(df_as2.index,df_as2.values,c = 'k',marker='.')

#分布来看,也没发现女的在某些年龄段人数过多,比如少年,比如中年这两个段,特别是中年这个段存活异常,应该不是性别造成的

    Sex和Fare

df_sf1 = data_train.Fare[data_train.Sex=='male'].value_counts()
df_sf2 = data_train.Fare[data_train.Sex=='female'].value_counts()
# plt.subplot(1,3,1)
plt.scatter(df_sf1.index,df_sf1.values,c='r',marker='.')
# plt.subplot(1,3,2)
plt.scatter(df_sf2.index,df_sf2.values,c = 'k',marker='.')

#就这么看来,女的好像买的贵哦 但是和之前船舱一样 也不怎么明显

    Sex 和 Em`·

df_sm1 = data_train.Embarked[data_train.Sex=='male'].value_counts()
df_sm2 = data_train.Embarked[data_train.Sex=='female'].value_counts()
pd.DataFrame({"df_sm1":df_sm1,"df_sm2":df_sm2}).plot(kind='bar')

    之前看到兄弟姐妹少的,父母小孩少的活下来的概率高点,那么组合一下,来一个家庭

data_train['family'] = data_train.SibSp + data_train.Parch
df_f0 = data_train.family[data_train.Survived==0].value_counts() df_f1 = data_train.family[data_train.Survived==1].value_counts() pd.DataFrame({'df_f0':df_f0,'df_f1':df_f1}).plot(kind='bar',sharey=True)

#1、2、3个的时候活下来的概率搞,其次是0个,绝对的单身狗吗 ,什么都没有,也是无语了

接下来我们统一分析了

1、Pclass 船票等级越高,越容易存活下来,是线性的,不需要加维度
2、Sex,女的越容易存活下来
3、Age,小的容易活下来,可能是因为年纪关系,15-25的活的不多,可能与船票有关系,买的低等舱,反而中年人有钱,买的高等的,或下的不少,看来要进行升维度了
4、Sibsp,一两个活下来的概率搞,其实是0个,升维度啊 <br>
5、Pare,社会自古以来原来都这样,有钱人活下来的概率高啊,正相关,只需要做归一化
6、Cabin,空的太多,不打算要了
7、Embrbed,出发港口的话呢,C活的概率搞点,综合来看,可能是因为一等舱的人数多,再加上女的比例不少导致的,可以去掉吗
8、family,多加了一项,发现1,2,3个家庭成员的时候,活的概率大,其次是0个,这个和之前的相加结果好心很类似哦

猜你喜欢

转载自www.cnblogs.com/rgzngf/p/12231404.html