泰坦尼克号船员获救预测

Python库介绍

  • Numpy—Python科学计算库
  • Pandas—Python数据分析处理库
  • Scikit-learn—Python机器学习库

数据介绍

本次使用的数据来源于kaggle

在这里插入图片描述

数据预处理

对缺失的数据进行填充:
在这里插入图片描述
在这里插入图片描述

线性回归模型

# !/usr/bin/env python
# —*— coding: utf-8 —*—
# @Time:    2020/1/3 9:42
# @Author:  Martin
# @File:    Titanic1.py
# @Software:PyCharm

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
# 读入数据
titanic = pd.read_csv('../res/train.csv')
# 数据预处理
titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())
titanic.loc[titanic['Sex'] == 'male', 'Sex'] = 0
titanic.loc[titanic['Sex'] == 'female', 'Sex'] = 1
titanic['Embarked'] = titanic['Embarked'].fillna('S')
titanic.loc[titanic['Embarked'] == 'S', 'Embarked'] = 0
titanic.loc[titanic['Embarked'] == 'C', 'Embarked'] = 1
titanic.loc[titanic['Embarked'] == 'Q', 'Embarked'] = 2
# 线性回归
predictors = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
alg = LinearRegression()
kf = KFold(n_splits=3, shuffle=False, random_state=None)
predictions = []
for train, test in kf.split(titanic[predictors]):
    train_predictors = (titanic[predictors].iloc[train, :])
    train_target = titanic['Survived'].iloc[train]
    alg.fit(train_predictors, train_target)
    test_predictions = alg.predict(titanic[predictors].iloc[test, :])
    predictions.append(test_predictions)
# 验证模型的准确率
predictions = np.concatenate(predictions, axis=0)
predictions[predictions > .5] = 1
predictions[predictions <= .5] = 0
accuracy = sum(predictions == titanic['Survived']) / len(predictions)
print(accuracy)

结果如下:
0.7833894500561167 0.7833894500561167

随机森林模型

# !/usr/bin/env python
# —*— coding: utf-8 —*—
# @Time:    2020/1/3 12:16
# @Author:  Martin
# @File:    Titanic2.py
# @Software:PyCharm

from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection
import pandas as pd
# 读入数据
titanic = pd.read_csv('../res/train.csv')
# 数据预处理
titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())
titanic.loc[titanic['Sex'] == 'male', 'Sex'] = 0
titanic.loc[titanic['Sex'] == 'female', 'Sex'] = 1
titanic['Embarked'] = titanic['Embarked'].fillna('S')
titanic.loc[titanic['Embarked'] == 'S', 'Embarked'] = 0
titanic.loc[titanic['Embarked'] == 'C', 'Embarked'] = 1
titanic.loc[titanic['Embarked'] == 'Q', 'Embarked'] = 2
# 随机森林模型
predictors = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=4, min_samples_leaf=2)
kf = model_selection.KFold(n_splits=3, shuffle=True, random_state=1)
scores = model_selection.cross_val_score(alg, titanic[predictors], titanic['Survived'], cv=kf)
print(scores.mean())

结果如下:
0.8260381593714926 0.8260381593714926

总结

我们还可以通过如下方法来继续提高预测的准确率:

  • 增加特征
  • 综合利用多种算法(回归+随机森林)
发布了102 篇原创文章 · 获赞 93 · 访问量 9651

猜你喜欢

转载自blog.csdn.net/Deep___Learning/article/details/103814669
今日推荐