Titanic rescued forecast

Data labels meanings:

  • PassengerId => passenger ID
  • Pclass => Level passengers (like class 1/2/3)
  • Name => Passenger Name
  • Sex => Sex
  • Age => Age
  • SibSp => cousin / sister number
  • Parch => number of parents and children
  • Ticket => ticket information
  • Fare => Tickets
  • Cabin => Cabin
  • Embarked => port of embarkation

 

 

The data read will be described

import pandas
titanic = pandas.read_csv('titanic_train.csv') print(titanic.describe())

Age has found missing values, missing values ​​using the average fill

titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median()) print(titanic.describe())

The value of the character, such as, gender, location board, replacing the numerical

print(titanic['Sex'].unique()) titanic.loc[titanic['Sex'] == 'male','Sex'] = 0 titanic.loc[titanic['Sex'] == 'female','Sex'] = 1 print(titanic['Embarked'].unique()) titanic['Embarked'] = titanic['Embarked'].fillna('S') titanic.loc[titanic['Embarked'] == 'S','Embarked'] = 0 titanic.loc[titanic['Embarked'] == 'C','Embarked'] = 1 titanic.loc[titanic['Embarked'] == 'Q','Embarked'] = 2

The Survived: rescued or not, as the label value, the introduction of cross-validation, wherein the label values ​​were analyzed retrospectively

from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import KFold predictors = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked'] alg = LinearRegression() #使用Kfold将样本的训练集做一个3倍的交叉验证 kf = KFold(titanic.shape[0],n_folds = 3,random_state = 1) #在每次交叉验证中建立回归模型 predictions = [] for train,test in kf: #取出训练集中的船员特征属性 train_predictors = (titanic[predictors].iloc[train,:]) #取出训练集中的是否获救的结果 train_target = titanic['Survived'].iloc[train] #将线性回归应用到数据 alg.fit(train_predictors,train_target) #运行测试结果 test_predictions = alg.predict(titanic[predictors].iloc[test,:]) #将结果收集 predictions.append(test_predictions)

Call numpy test results (rescued probability) to 50% for the sector do dichotomous, and compare predictions with actual results arrive at a correct rate

import numpy as np
#调用数组操作函数 predictions = np.concatenate(predictions,axis=0) #将输出的0到1区间内的结果以0.5作为分界点做二级分化 predictions[predictions > .5] = 1 predictions[predictions <= .5] = 0 #将预测输出的结果与训练集中的真实结果进行正确率比较 accuracy = sum(predictions ==titanic['Survived']) / len(predictions) print(accuracy)

0.7833894500561167

Try using random forest method to see if the correct rate can improve

from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier #导入特征集合 predictors = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked'] #创建随机森林 决策树数量 为10个,停止条件为最小树枝为2或最小叶子数为一 alg = RandomForestClassifier(random_state = 1,n_estimators = 10,min_samples_split = 2,min_samples_leaf = 1) #再进行一次交叉检验 kf =cross_validation.KFold(titanic.shape[0],n_folds = 3,random_state = 1) #进行模型评估 分类器为随机森林,数据为船员特征,目标为生存率,参数为交叉检验的结果 scores = cross_validation.cross_val_score(alg,titanic[predictors],titanic['Survived'],cv = kf) print(scores.mean())

0.7856341189674523

Slightly enhance accuracy. Development of brain-hole ready to add some new features such as: family size = SibSp + full-length Parch, name, middle name and identity call.

#统计家庭规模为长辈与兄弟的总和
titanic["FamilySize"] = titanic['SibSp'] + titanic['Parch'] #统计船员名字字母总长度 titanic['NameLength'] = titanic['Name'].apply(lambda x : len(x)) 

After use regular middle name matches the selected person identity, value to be encoded, characterized in introducing new 'title'

import re
#使用正则表达 截取人名中的身份称呼
def get_title(name): title_search = re.search('([A-Za-z]+)\.',name) if title_search: return title_search.group(1) return '' #以身份称呼为分类 统计船员个数 titles = titanic['Name'].apply(get_title) print(pandas.value_counts(titles)) #将身份称呼进行 数字编码 title_mapping = {"Mr":1,"Miss":2,"Mrs":3,"Master":4,"Dr":5,"Rev":6,"Major":7,"Col":7,"Mlle":8,"Mme":8,"Don":9,"Lady":10,"Countess":10,"Jonkheer":10,"Str":9,"Capt":7,"Ms":2,"Sir":9} for k,v in title_mapping.items(): titles[titles==k]=v print(pandas.value_counts(titles)) #将转化好的特征新增到数据集名称Title中 titanic['Title'] = titles

Introducing SKlearn feature selection module, by focusing added to the training noise value to determine the maximum impact characteristics

#导入特征选择模块
import numpy as np from sklearn import cross_validation from sklearn.feature_selection import SelectKBest,f_classif import matplotlib.pyplot as plt predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","FamilySize","Title","NameLength"] #通过加入噪音值观察 selector = SelectKBest(f_classif,k=5) selector.fit(titanic[predictors],titanic["Survived"]) scores=-np.log10(selector.pvalues_) #输出柱状图 plt.bar(range(len(predictors)),scores) plt.xticks(range(len(predictors)),predictors,rotation='vertical') plt.show() #选取影响最大的特征作为新的特征集 predictors = ["Pclass","Sex","NameLength","Title","Fare""]

Not wasted two hard impact of the new features added really large selection of five features greater impact as a new feature set, using the Random Forest model again, and adjustment parameters

from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier #导入特征集合 predictors = ["Pclass","Sex","NameLength","Title","Fare"] #创建随机森林 决策树数量 为50个,停止条件为最小树枝为4或最小叶子数为2 alg = RandomForestClassifier(random_state = 1,n_estimators = 50,min_samples_split = 4,min_samples_leaf = 10) #再进行一次交叉检验 kf =cross_validation.KFold(titanic.shape[0],n_folds = 3,random_state = 1) #进行模型评估 分类器为随机森林,数据为船员特征,目标为生存率,参数为交叉检验的结果 scores = cross_validation.cross_val_score(alg,titanic[predictors],titanic['Survived'],cv = kf) print(scores.mean())

0.8159371492704826

The correct rate, then use SKlearn combination of function modules will return to the Random Forest algorithm using a combination

from sklearn.ensemble import GradientBoostingClassifier
import numpy as np from sklearn.linear_model import LogisticRegression algorithms = [ [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), ["Pclass","Sex","NameLength","Title","Fare"]], [LogisticRegression(random_state=1), ["Pclass","Sex","NameLength","Title","Fare"]] ] kf = KFold(titanic.shape[0], n_folds=3, random_state=1) predictions = [] for train, test in kf: train_target = titanic["Survived"].iloc[train] full_test_predictions = [] for alg, predictors in algorithms: alg.fit(titanic[predictors].iloc[train,:], train_target) test_predictions = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1] full_test_predictions.append(test_predictions) test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2 test_predictions[test_predictions <= .5] = 0 test_predictions[test_predictions > .5] = 1 predictions.append(test_predictions) predictions = np.concatenate(predictions, axis=0) accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions) print(accuracy) 

0.821548821549

 

The final results of the highest accuracy,

Guess you like

Origin www.cnblogs.com/czlong/p/11705147.html