## 小呆学数据分析——Titanic disaster生存率预测

#### 1. 分析问题

``````import pandas as pd

pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', None)
print(df)
print(df.columns)
print(df.shape)
``````

``````      PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket      Fare        Cabin Embarked
0              1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171    7.2500          NaN        S
1              2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599   71.2833          C85        C
..           ...       ...     ...                                                ...     ...   ...    ...    ...               ...       ...          ...      ...
889          890         1       1                              Behr, Mr. Karl Howell    male  26.0      0      0            111369   30.0000         C148        C
890          891         0       3                                Dooley, Mr. Patrick    male  32.0      0      0            370376    7.7500          NaN        Q
[891 rows x 12 columns]
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
(891, 12)
``````

``````1.PassengerId:乘客编号
2.Survived:存活与否
3.Pclass:客舱等级
4.Name:乘客姓名
5.Sex:乘客性别
6.Age:乘客年龄
7.SibSp:乘客兄弟姐妹等亲戚个数
8.Parch:乘客随行父母/子女个数
9.Ticket:票号
10.Fare:票价
11.Cabin:仓号
12.Embarked:从哪里上船
``````
##### 1.1 影响因素分析

1.Name列:由于不是算命先生，小呆确认乘客姓名和存活与否没有半毛钱关系
2.Ticket列:由于小呆不信命，所以确认乘客票号和存活与否没有关系（另外观察Ticket内容不具有规律性，所以也是排除原因）

``````df2 = df.drop(['Name', 'Ticket'],axis = 1)
print(df2)
``````

``````     PassengerId  Survived  Pclass     Sex   Age  SibSp  Parch     Fare Cabin Embarked
0              1         0       3    male  22.0      1      0   7.2500   NaN        S
1              2         1       1  female  38.0      1      0  71.2833   C85        C
2              3         1       3  female  26.0      0      0   7.9250   NaN        S
3              4         1       1  female  35.0      1      0  53.1000  C123        S
4              5         0       3    male  35.0      0      0   8.0500   NaN        S
..           ...       ...     ...     ...   ...    ...    ...      ...   ...      ...
886          887         0       2    male  27.0      0      0  13.0000   NaN        S
887          888         1       1  female  19.0      0      0  30.0000   B42        S
888          889         0       3  female   NaN      1      2  23.4500   NaN        S
889          890         1       1    male  26.0      0      0  30.0000  C148        C
890          891         0       3    male  32.0      0      0   7.7500   NaN        Q

[891 rows x 10 columns]
``````

###### 1.1.1 性别、舱等级

``````survived_class = []
survived_class.append(df2[df2['Pclass'] < 2]['Survived'].sum())
survived_class.append(df2[df2['Pclass'] == 2]['Survived'].sum())
survived_class.append(df2[df2['Pclass'] > 2]['Survived'].sum())

survived_sex=[]

survived_sex.append(df2[df2['Sex']=='female']['Survived'].sum())
survived_sex.append(df2[df2['Sex']=='male']['Survived'].sum())

ind1 = np.arange(3)
fig = plt.figure('Data Analysis: Titanic Disaster')
ax_class = plt.subplot(1,2,1)
ax_class.bar(ind1, survived_class, 0.3)
plt.xticks(ind1,('First Class', 'Second Class', 'Third Class'))
plt.title('Survived passengers of difference class')

ind2 = np.arange(2)
ax_sex = plt.subplot(1,2,2)
ax_sex.bar(ind2, survived_sex, 0.3)
plt.xticks(ind2,('Female', 'Male'))
plt.title('Survived passengers of difference sex')

plt.show()
``````

1.在Titanic上人数三等舱>头等舱>二等舱，女性<男性；
2.最终存活概率头等舱>二等舱>三等舱，女性>男性。

###### 1.1.2 年龄

``````print(df2[df2['Age'].isna()])
``````

``````5              6         0       3    male  NaN      0      0           330877    8.4583    NaN
17            18         1       2    male  NaN      0      0           244373   13.0000    NaN
19            20         1       3  female  NaN      0      0             2649    7.2250    NaN
26            27         0       3    male  NaN      0      0             2631    7.2250    NaN
..           ...       ...     ...     ...  ...    ...    ...              ...       ...    ...
863          864         0       3  female  NaN      8      2         CA. 2343   69.5500    NaN
868          869         0       3    male  NaN      0      0           345777    9.5000    NaN
878          879         0       3    male  NaN      0      0           349217    7.8958    NaN
888          889         0       3  female  NaN      1      2       W./C. 6607   23.4500    NaN

[177 rows x 10 columns]
``````

###### 1.1.3 家庭成员

``````survived_sibsp = []
for loopi in range(0, df2.SibSp.max()+1):
survived_sibsp.append(df2[df2.SibSp==loopi]['Survived'].sum())

survived_parch = []
for loopj in range(0, df2.Parch.max()+1):
suvived_parch.append(df2[df2.Parch==loopj]['Survived'].sum())

ind4 = np.range(survived_sibsp.shape[0])
ax_sibsp = plt.subplot(1,2,1)
ax_sibsp.bar(ind4, survived_sibsp, 0.3)
plt.xticks(ind4, ind4)

ind5 = np.range(survived_parch.shape[0])
ax_parch = plt.subplot(1,2,2)
ax_parch.bar(ind5, survived_parch, 0.3)
plt.xticks(ind5, ind5)

plt.show()
``````

###### 1.1.4 票价

``````survived_fare = df2[df2.Survived==1]['Fare'].mean()
plt.pie(fare, labels=labels, autopct='%3.1f%%')
plt.title('Survived percentage of different fare')
plt.show()
``````

###### 1.1.5 舱号

``````print(df2[df2.Cabin.isna()])
``````

``````     PassengerId  Survived  Pclass     Sex   Age  SibSp  Parch     Fare Cabin Embarked
0              1         0       3    male  22.0      1      0   7.2500   NaN        S
2              3         1       3  female  26.0      0      0   7.9250   NaN        S
4              5         0       3    male  35.0      0      0   8.0500   NaN        S
5              6         0       3    male   NaN      0      0   8.4583   NaN        Q
7              8         0       3    male   2.0      3      1  21.0750   NaN        S
..           ...       ...     ...     ...   ...    ...    ...      ...   ...      ...
884          885         0       3    male  25.0      0      0   7.0500   NaN        S
885          886         0       3  female  39.0      0      5  29.1250   NaN        Q
886          887         0       2    male  27.0      0      0  13.0000   NaN        S
888          889         0       3  female   NaN      1      2  23.4500   NaN        S
890          891         0       3    male  32.0      0      0   7.7500   NaN        Q

[687 rows x 10 columns]
``````

``````surivived_cabin = []
survived_cain.append(df2[~df2.Cabin.isna()]['Survived'].sum())
survived_cabin.append(df2[df2.Cabin.isna()]['Survived'].sum())
survived_cabin = np.array(survived_cabin)

survived_cabin1 = survived_cabin/cabin*100

ind6 = np.arange(len(survived_cabin))
ax_cabin = plt.subplot(1,2,1)
ax_cabin.bar(ind6,survived_cabin, 0.3)
plt.xticks(ind6, ('YES','No'))
plt.title('Survived passenger of different Cabin')

ax_cabin = plt.subplot(1,2,2)
ax_cabin.bar(ind6,survived_cabin1, 0.3)
plt.xticks(ind6, ('YES','No'))
plt.title('Survived passenger of different Cabin')
plt.ylabel('Percentage %')
``````

###### 1.1.6 登船地

``````print(df[df.Embarked.isna()])
``````

``````     PassengerId  Survived  Pclass                                       Name     Sex   Age  SibSp  Parch  Ticket  Fare Cabin Embarked
61            62         1       1                        Icard, Miss. Amelie  female  38.0      0      0  113572  80.0   B28      NaN
829          830         1       1  Stone, Mrs. George Nelson (Martha Evelyn)  female  62.0      0      0  113572  80.0   B28      NaN
``````

``````survived_embarked = []
survived_embarked.append(df2[df2.Embarked=='S']['Survived'].sum())
survived_embarked.append(df2[df2.Embarked=='C']['Survived'].sum())
survived_embarked.append(df2[df2.Embarked=='Q']['Survived'].sum())

ind5 = np.arange(len(survived_embarked))
ax_embarked = plt.subplot(1,2,1)
ax_embarked.bar(ind5, survived_embarked, 0.3)
plt.xticks(ind5, ('S','C' , 'Q'))
plt.title('Survived passenger of different embarked')

ax_embarked = plt.subplot(1,2,2)
ax_embarked.bar(ind5, survived_embarked1, 0.3)
plt.xticks(ind5, ('S','C' , 'Q'))
plt.title('Survived passenger of different embarked')
plt.ylabel('Percentage %')
``````

##### 1.2 特征工程

###### 1.2.1 缺失值补充

``````df2.loc[df2.Age.isna(), 'Age'] = df.Age.mean()
df2.loc[df2.Embarked.isna(), 'Embarked'] = 'S'
``````
###### 1.2.2 数据标准化

``````from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler()
age = np.array(df2.Age).reshape(df2.shape[0], 1)
age_std = sc.fit_transform(age)

fare = np.array(df2.Fare).reshape(df2.shape[0], 1)
fare_std = sc.fit_transform(fare)
``````
###### 1.2.3 One-Hot编码

``````    dummies_cabin = pd.get_dummies(df2['Cabin'], prefix='Cabin')
dummies_sex = pd.get_dummies(df2.Sex, prefix='Sex')
dummies_family = pd.get_dummies(family, prefix='Family')
dummies_embarked = pd.get_dummies(df2.Embarked, prefix='Embarked')
dummies_pclass = pd.get_dummies(df2.Pclass, prefix='Pclass')
``````
##### 1.3 特征向量

``````def featrue(df):
df2 = df.drop(['Name', 'Ticket'], axis=1)

df2.loc[~df2.Cabin.isna(), 'Cabin'] = 'YES'
df2.loc[df2.Cabin.isna(), 'Cabin'] = 'NO'

family = df2.SibSp + df2.Parch
family.loc[family>3] = 4

dummies_cabin = pd.get_dummies(df2['Cabin'], prefix='Cabin')
dummies_sex = pd.get_dummies(df2.Sex, prefix='Sex')
dummies_family = pd.get_dummies(family, prefix='Family')
dummies_embarked = pd.get_dummies(df2.Embarked, prefix='Embarked')
dummies_pclass = pd.get_dummies(df2.Pclass, prefix='Pclass')

df2.loc[df2.Age.isna(), 'Age'] = df2.Age.mean()
age = np.array(df2.Age).reshape(df2.Age.shape[0], 1)
df2.loc[df2.Fare.isna(), 'Fare'] = df2.Fare.mean()
fare = np.array(df2.Fare).reshape(df2.Fare.shape[0], 1)

sc = MinMaxScaler()
age_std = pd.DataFrame(sc.fit_transform(age), columns=['Age_std'], index=df.index.values)
fare_std = pd.DataFrame(sc.fit_transform(fare), columns=['Fare_std'], index=df.index.values)

df3 = pd.concat([df2, dummies_pclass, dummies_sex, dummies_family, dummies_cabin, dummies_embarked, age_std, fare_std],
axis=1)
df3 = df3.drop(['Pclass', 'Sex', 'SibSp', 'Parch', 'Cabin', 'Embarked', 'Age', 'Fare', 'Survived'], axis=1)

return df3
``````

#### 2. 训练模型

##### 2.1 划分训练集和验证集

sklearn库中model_selection中有划分工具

``````from sklearn.model_selection import train_test_split
train1, test1, train_label1, test_label1 = train_test_split(df, df.Survived, test_size=0.4, random_state=20)
train_featrue = featrue(train1)
test_featrue = featrue(test1)
``````
##### 2.2 多种机器学习方法结果

``````	from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
lg_clf = LogisticRegression(C=1)
lg_clf.fit(train_featrue, train_label1)
lg_score1 = lg_clf.score(train_featrue, train_label1)
lg_score2 = lg_clf.score(test_featrue, test_label1)
print('='*40)
print('Method: LogisticRegression')
print('score on train set:', lg_score1)
print('score on test set:', lg_score2)

knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(train_featrue, train_label1)
knn_score1 = knn_clf.score(train_featrue, train_label1)
knn_score2 = knn_clf.score(test_featrue, test_label1)
print('='*40)
print('Method: KNeighborsClassifier')
print('score on train set:', knn_score1)
print('score on test set:', knn_score2)

svm_clf = SVC(C=0.1, kernel='rbf', gamma='scale')
svm_clf.fit(train_featrue, train_label1)
svm_score1 = svm_clf.score(train_featrue, train_label1)
svm_score2 = svm_clf.score(test_featrue, test_label1)
print('='*40)
print('Method: Support Vector Machine')
print('Score on Train Set:', svm_score1)
print('Score on Test Set:', svm_score2)
``````

``````========================================
Method: LogisticRegression
score on train set: 0.8164794007490637
score on test set: 0.8207282913165266
========================================
Method: KNeighborsClassifier
score on train set: 0.846441947565543
score on test set: 0.7478991596638656
========================================
Method: Support Vector Machine
Score on Train Set: 0.8389513108614233
Score on Test Set: 0.7899159663865546
``````

``````========================================
Method: LogisticRegression
score on train set: 0.8164794007490637
score on test set: 0.8235294117647058
========================================
Method: KNeighborsClassifier
score on train set: 0.846441947565543
score on test set: 0.7478991596638656
========================================
Method: Support Vector Machine
Score on Train Set: 0.8071161048689138
Score on Test Set: 0.8123249299719888
``````

#### 3.预测

``````    predict_df = pd.read_csv(r'h:\dataanalysis\titanic\test.csv')
predict_df.set_index(['PassengerId'], inplace=True)

predict_featrue = featrue(predict_df)
#print(predict_featrue)
lg_clf_predict = lg_clf.predict(predict_featrue)
knn_clf_predict = knn_clf.predict(predict_featrue)
svm_clf_predict = svm_clf.predict(predict_featrue)
lg_clf_predict = pd.DataFrame(lg_clf_predict, columns=['Survived'], index=predict_df.index)
knn_clf_predict = pd.DataFrame(knn_clf_predict, columns=['Survived'], index=predict_df.index)
svm_clf_predict = pd.DataFrame(svm_clf_predict, columns=['Survived'], index=predict_df.index)
#print(lg_clf_predict)
lg_clf_predict.to_csv(r'h:\dataanalysis\titanic\submission1.csv')
knn_clf_predict.to_csv(r'h:\dataanalysis\titanic\submission2.csv')
svm_clf_predict.to_csv(r'h:\dataanalysis\titanic\submission3.csv')
``````