【kaggle入门题一】Titanic: Machine Learning from Disaster

原题：

Start here if...

You're new to data science and machine learning, or looking for a simple intro to the Kaggle prediction competitions.

Competition Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Practice Skills

Binary classification
Python and R basics

训练数据：

训练数据中的特征：

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked

特征

PassengerId

Survived

Pclass

Name

Sex

Age

SibSp

Parch

Ticket

Fare

Cabin

Embarked

解释

乘客ID

死亡0/幸存/1

经济等级(1=high、2=middle、3=low)

乘客姓名

性别

年龄

船上的兄弟姐妹个数

船上的父母孩子个数

船票号码

票价

客舱号码

登船港口

解决思路：加载样本->求出总数、总计、均值、方差->利用均值补全空白值->。。。->交叉验证(将训练数据做测试，123选中其二作为训练模型，剩下一个作为测试(原测试集不用)，交叉训练验证取平均值)->线性回归->逻辑回归->随机森林

#coding=utf-8
import os
file_root = os.path.realpath('titanic')
file_name_test = os.path.join(file_root, "test.csv")
file_name_train = os.path.join(file_root, "train.csv")
import pandas as pd
#显示所有信息
pd.set_option('display.max_columns' , None)
titanic = pd.read_csv(file_name_train)
data = titanic.describe()

#可以查看有哪些缺失值
titanic.info()
#缺失的Age内容进行取均值替换
titanic['Age'].fillna(titanic['Age'].median(), inplace=True)
data = titanic.describe()
print(data)

#查看Sex下属性值,并替换
print("Sex原属性值", titanic['Sex'].unique())
titanic.loc[titanic['Sex'] == "male", "Sex"] = 0
titanic.loc[titanic['Sex'] == "female", "Sex"] = 1
print("Sex替换后的属性值", titanic['Sex'].unique())
#查看Embarked下属性值，并替换
print("Embarked原属性值", titanic['Embarked'].unique())
titanic["Embarked"] = titanic["Embarked"].fillna('S')
titanic.loc[titanic['Embarked'] == 'S', 'Embarked'] = 0
titanic.loc[titanic['Embarked'] == 'C', 'Embarked'] = 1
titanic.loc[titanic['Embarked'] == 'Q', 'Embarked'] = 2
print("Embarked替换后的属性值", titanic['Embarked'].unique())

#线性回归模型预测
from sklearn.linear_model import LinearRegression
#交叉验证
from sklearn import model_selection
#特征值
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
#初始化
alg = LinearRegression()
#titanic.shape[0]:表示得到m和n的二元组，也就是样本数目；表示n_folds：表示做基层的交叉验证；
print("titanic.shape[0]:", titanic.shape[0])
# kf = model_selection.KFold(titanic.shape[0], n_folds=3, random_state=1)
kf = model_selection.KFold(n_splits=3, random_state=1, shuffle=False)
predictions = []
#n_folds=3遍历三层
for train, test in kf.split(titanic['Survived']):
    #把训练数据拿出来
    train_predictors = titanic[predictors].iloc[train,:]
    #我们使用样本训练的目标值
    train_target = titanic['Survived'].iloc[train]
    #应用线性回归,训练回归模型
    alg.fit(train_predictors, train_target)
    #利用测试集预测
    test_predictions = alg.predict(titanic[predictors].iloc[test,:])
    predictions.append(test_predictions)

#看测试集的效果，回归值区间值为[0-1]
import numpy as np
#numpy提供了numpy.concatenate((a1,a2,...), axis=0)函数。能够一次完成多个数组的拼接。其中a1,a2,...是数组类型的参数
predictions = np.concatenate(predictions, axis=0)

predictions[predictions > .5] = 1
predictions[predictions <= .5] = 0
accuracy = sum(predictions[predictions == titanic['Survived']]) / len(predictions)
print("线性回归模型： ", accuracy)
#输出：0.78...
#采用逻辑回归方式实现
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings("ignore")
#初始化
alg = LogisticRegression(random_state=1)
#比较测试值
scores = model_selection.cross_val_score(alg, titanic[predictors], titanic['Survived'], cv=3)
print("逻辑回归模型： ", scores.mean())

#采用随机森林实现：构造多颗决策树共同决策结果，取出多次结果的平均值。
#随机森林在这七个特征当中进行随机选择个数
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
pridictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
#参数：随机数、用了多少树、最小样本个数、最小叶子结点个数
alg = RandomForestClassifier(random_state=1, n_estimators=50, min_impurity_split=4, min_samples_leaf=2)
kf = model_selection.KFold(n_splits=3, random_state=1, shuffle=False)
kf = kf.split(titanic['Survived'])
scores = model_selection.cross_val_score(alg, titanic[predictors], titanic['Survived'], cv=kf)
print("随机森林： ", scores.mean())

视频地址：https://study.163.com/course/courseLearn.htm?courseId=1003551009#/learn/video?lessonId=1004052091&courseId=1003551009