kaggle入门及titanic案例学习

本文长度20000字,约有300行python代码,建议阅读时间30分钟。

一. 准备工作

1.1 注册账号

在kaggle 网站 https://www.kaggle.com/ 注册账号。

1.2 基础课程学习

在开始前,可以看看网站提供的快速数据科学入门课程。
课程内容包括:python基础,机器学习入门,数据可视化,pandas库,特征工程,深度学习,SQL等。都是在后续将要用到的工具,如果对这些不熟悉,建议学习一下基础课程。
在学完基础的课程后就可以尝试下titanic案例,这是一个入门的案例。

1.3 安装jupyter

使用vscode,pycharm这些也可以,但缺点是运行过程中中间变量不能看到。而jupyter,spyder类似于matlab,在一段程序运行完成后,变量仍然保存,这在vscode和pycharm中需要使用打断点的方式实现。
使用jupyter还有另外的好处,可以在代码中插入markdown格式的内容,可以做更加复杂的笔记。
Ubuntu jupyter 安装

sudo pip3 install jupyter

Windwos+VScode jupyter 安装
在这里插入图片描述

"C:/Program Files (x86)/Microsoft Visual Studio/Shared/Python37_64/python.exe" -m pip install -U jupyter --user

把python.exe路径替换为自己的安装路径
当然如果使用anaconda的话,在首页可以直接安装。

二. 准备开发环境和数据

一般来说,简单的机器学习算法案例需要用到的库有如下几个:

  • numpy,scipy:基础的数值计算
  • pandas:数据分析,数据清理,数据整合,处理大型数据和时序数据
  • matplotlib,seaborn:数据可视化
  • sklearn:提供大量机器学习工具,包括分类,回归,聚类,降维,模型选择,数据预处理等。
  • tensorflow,pytorch,keras:深度学习工具。

安装这些python包。

还有titanic案例需要的CSV格式数据下载,根据网站指引可以看到下载地址。

三. 我的实现

首先放一个笔者自己的一个简单实现。

3.1 代码

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb

data = pd.read_csv("data/titan/train.csv")
# print(data.head())
# 只取3列进行拟合
X = data[['Pclass', 'Age', 'Sex']]
Y = data['Survived']

# 对年龄列进行填充
X['Age'].fillna(X['Age'].mean(), inplace=True)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 33)

# 使用LabelEncoder对性别列进行处理
label_encoder = LabelEncoder()
X_train['Sex'] = label_encoder.fit_transform(X_train['Sex'])


# 使用模型回归
dtrain = lgb.Dataset(X_train, Y_train)
dvalid = lgb.Dataset(X_test, Y_test)

param = {'num_leaves': 64, 'objective': 'binary'}
param['metric'] = 'auc'
num_round = 1000
bst = lgb.train(param, 
                dtrain, 
                num_round, 
                valid_sets=[dvalid], 
                early_stopping_rounds=10, 
                verbose_eval=False)

# Making predictions & evaluating the model
from sklearn import metrics
ypred = bst.predict(X_test)
score = metrics.roc_auc_score(Y_test, ypred)
print(f"Test AUC score: {score}")

# 对test进行同样的处理
newdata = pd.read_csv("data/titan/test.csv")
# print(data.head())
newX = newdata[['Pclass', 'Age', 'Sex']]

# 对年龄列进行填充
newX['Age'].fillna(newX['Age'].mean(), inplace=True)
# 对性别进行处理
newX['Sex'] = label_encoder.transform(newX['Sex'])
# 进行预测
ypred = bst.predict(newX)
yfinal = [int(i>0.5) for i in ypred]

# 将结果写入文件
savetemp = {'Survived':list(yfinal)}
save = pd.DataFrame(savetemp).join(newdata['PassengerId'])
save = save[['PassengerId','Survived']]  #对调两列位置
file_name = 'data/titan/test1.csv'
save.to_csv(file_name,index = False)

3.2 具体思路

  • 把train.csv分成train和test两部分,进行模型训练。
  • 只取3个字段’Pclass’, ‘Age’, 'Sex’进行预测。
  • 对缺失的的值用平均数填充
  • 对类别字段’Sex’,使用LabelEncoder进行编码

最后得到的结果为74%。

四. 更好的实现(翻译)

后来在网站上找到一个对初学者比较友好的例子,基本上把前面入门课程介绍的内容都用上了。
https://www.kaggle.com/loyashoib/beginners-notebook-to-achieve-80-accuracy

以下是对这个教程的翻译

4.1 前期准备

4.1.1 库的导入

#Machine Learning Packages
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn import model_selection

#Data Processing Packages
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection

#Data Visualization Packages
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#Metrics
from sklearn import metrics

#Data Analysis Packages
import pandas as pd
import numpy as np

#Ignore Warnings
import warnings
warnings.filterwarnings('ignore')

4.1.2 数据导入

#Loading the data sets
gender_submission = pd.read_csv("../input/titanic/gender_submission.csv")
test = pd.read_csv("../input/titanic/test.csv")
train = pd.read_csv("../input/titanic/train.csv")
target = train["Survived"]
data = pd.concat([train.drop("Survived", axis=1),test], axis=0).reset_index(drop=True)

这里作者把train和test合到一起,一是便于查看整个数据的分布,二是因为每个字段间会有联系,在后续补全空值特征时可以用到整个数据的特征。

接下来对数据进行查看,看有多少行,那些字段,每个字段类型,是否有空值。

data.head() #查看数据前几行,默认是5行
data.info() # 查看数据每个字段的非空行数,总行数等信息

4.2 处理缺失数据

4.2.1 年龄

首先对年龄和其他字段分别进行关联分析

#Plotting the relations between Age and other features (Mean).
fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(12,12))

#Age vs Pclass
sns.barplot(x="Pclass", y="Age", data=data, ax=ax[0,0])
#Age vs Sex
sns.barplot(x="Sex", y="Age", data=data, ax=ax[0,1])
#Age vs SibSp
sns.barplot(x="SibSp", y="Age", data=data, ax=ax[0,2])
#Age vs Parch
sns.barplot(x="Parch", y="Age", data=data, ax=ax[1,0])
#Age vs Family_size
sns.barplot(x=(data["Parch"] + data["SibSp"]), y="Age", data=data, ax=ax[1,1])
ax[1,1].set(xlabel='Family Size')
#Age vs Embarked
sns.barplot(x="Embarked", y="Age", data=data, ax=ax[1,2])

(图片)

#Plotting relations between Age and other features (Median).

fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(12,12))
#Age vs Pclass
sns.boxplot(x="Pclass", y="Age", data=data, ax=ax[0,0])
#Age vs Sex
sns.boxplot(x="Sex", y="Age", data=data, ax=ax[0,1])
#Age vs SibSp
sns.boxplot(x="SibSp", y="Age", data=data, ax=ax[0,2])
#Age vs Parch
sns.boxplot(x="Parch", y="Age", data=data, ax=ax[1,0])
#Age vs Family_size
sns.boxplot(x=(data["Parch"] + data["SibSp"]), y="Age", data=data, ax=ax[1,1])
ax[1,1].set(xlabel='Family Size')
#Age vs Embarked
sns.boxplot(x="Embarked", y="Age", data=data, ax=ax[1,2])

(图片)

从图中分析出座位等级越高,平均年龄越大。

本教程使用Pclass, Family Size and Embarked 3个特征进行年龄填充,具体是找到和缺失年龄的列相同的Pclass, Family Size and Embarked的列所有年龄中位数进行填充,如果没有这样的行,则使用所有年龄中位数进行填充。

# First Lets create the feature Family_Size
data["Family_Size"] = data["SibSp"] + data["Parch"]

#Filling in the missing Age values
missing_age_value = data[data["Age"].isnull()]
for index, row in missing_age_value.iterrows():
    median = data["Age"][(data["Pclass"] == row["Pclass"]) & (data["Embarked"] == row["Embarked"]) & (data["Family_Size"] == row["Family_Size"])].median()
    if not np.isnan(median):
        data["Age"][index] = median
    else:
        data["Age"][index] = np.median(data["Age"].dropna())

4.2.2 票价

绘制票价和舱位等级关系

#Relation Between Fare and Pclass
fig, ax = plt.subplots(figsize=(7,5))
sns.boxplot(x="Pclass", y="Fare", data=data)

(图)

可以看出舱位等级越高,票价越高,所以用同等舱位的票价中位数进行填充

#Since we have only 1 missing value for Fare we can just fill it according to Pclass feature
print("Pclass of the data point with missing Fare value:", int(data[data["Fare"].isnull()]["Pclass"]))
median = data[data["Pclass"] == 3]["Fare"].median()
data["Fare"].fillna(median, inplace=True)

4.2.3 船舱

先查看Cabin字段有那些取值

data.Cabin.unique()

对非空字段,取第一个字母。空字段,取’X’

# 用首字母替换,null替换为’X‘
for index, rows in data.iterrows():
    if pd.isnull(rows["Cabin"]):
        data["Cabin"][index] = 'X'
    else:
        data["Cabin"][index] = str(rows["Cabin"])[0]

4.2.4 上船港口

只有两个缺失,使用众数填充

#Since we only have 2 missing Embarked values we will just fill the missing values with mode.众数
data["Embarked"].fillna(data["Embarked"].mode()[0], inplace=True)

4.3 特征工程

对数据进行清洗后,需要决定那些特征对预测比较重要,从而把它们转为及机器学习模型可以识别的形式。

4.3.1 分析存活和舱位等级,性别的关系

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12,5))

#Survived vs Pclass
sns.barplot(x="Pclass", y=target, data=data[:891], ax=ax[0])

#Survived vs Sex
sns.barplot(x="Sex", y=target, data=data[:891], ax=ax[1])

(图)

从图中看出,舱位等级越高,存活率越高,女性存活率远高于男性

4.3.2 分析存活和Family_Size的关系

Family_Size不是原始数据中的字段,而是从原始数据父母/子女,配偶/兄弟姐妹相加得到。

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(14,5))

#Survived vs Family_Size
sns.barplot(x="Family_Size", y=target, data=data[:891], ax=ax[0])

#Sex vs Single Passengers
sns.countplot(x="Sex", data=data[data["Family_Size"] == 0], ax=ax[1])

(图)

从图中看出:
1.从左图看出,单独登船的人有更低的机会存活,这可从右图得到证明:乘客中单身男性多于单身女性。
(原文:Singles have a lower chance of survival. This can be verified by the fact that there are more ‘male single’ passengers than ‘female single’ passsengers)
(译者认为此处是作者失误,从右图得到:看到单身男性加女性共有近800人,而数据总共约1300。这个结论更能证明单独登船存活机会小。)
2.家庭人数在1,2,3有更高的几率存活,这是因为女性和孩子优先逃生的事实。
3.家庭人口大于等于4,存活率反而下降,作者也不知道是什么原因。

基于此分析,可以把家庭人口分成3个等级,分别是单身,家庭人口大于等于4,家庭人口在1到3之间。

#Dividing Family_Size into 3 groups
data["Family_Size"] = data["Family_Size"].map({0:0, 1:1, 2:1, 3:1, 4:2, 5:2, 6:2, 7:2, 8:2, 9:2, 10:2})

4.3.3 分析登船港口和舱位

plt.figure(figsize=(12,7))

#Survived vs Cabin
plt.subplot(121)
sns.barplot(x="Cabin", y=target, data=data[:891])

#Survived vs Embarked
plt.subplot(122)
sns.barplot(x="Embarked", y=target, data=data[:891])

(图)

从图中看出,登船港口和舱位影响存活。先前舱位信息缺失填充为X,因此舱位缺失有更小的几率存活。

4.3.4 分析票价特征

plt.figure(figsize=(12,7))

#Plotting Kde for Fare
plt.subplot(121)
sns.kdeplot(data["Fare"])

#Plotting Kde for Fare with Survived as hue
plt.subplot(122)
sns.kdeplot(np.log(data[:891][target == 1]["Fare"]), color='blue', shade=True)
sns.kdeplot(np.log(data[:891][target == 0]["Fare"]), color='red', shade=True)
plt.legend(["Survived", "Not Survived"])

#Since skewness can result in false conclusions we reduce skew for fare by taking log.
data["Fare"] = np.log(data["Fare"])

#Dividing Fare into different categories
data["Fare"] = pd.qcut(data["Fare"], 5)

票价越高,越可能存活。这与先前结论相关印证,头等舱越有可能存活。因此票价也是一个重要的特征

4.3.5分析年龄特征

不同年龄段存活率不同,可以对年龄进行分段

label = LabelEncoder()
data["Age"] = label.fit_transform(pd.cut(data["Age"].astype(int), 5))
sns.barplot(x="Age", y=target, data=data[0:891])

(图)

4.3.6 分析姓名特征

得到姓名中反应身份地位的字段

data["Name"] = data["Name"].apply(lambda x: x.split(",")[1].split(".")[0].strip())

得到unique值,看看这些字段包括什么

data.Name.unique()

对这些字段根据身份等级进行重映射

data["Name"] = data["Name"].map({'Mr':1, 'Miss':2, 'Mrs':3, 'Ms':3, 'Mlle':3, 'Mme':3, 'Master':4, 'Dr':5, 'Rev':5, 'Col':5, "Major":5, "Dona":5, "Sir":5, "Lady":5, "Jonkheer":5, "Don":5, "the Countess":5, "Capt":5})
sns.barplot(x="Name", y=target, data=data[0:891])

(图)

4.3.7 分析船票特征

data["Ticket"] = data["Ticket"].apply(lambda x: x.replace(".","").replace('/',"").strip().split(' ')[0] if not x.isdigit() else 'X')
data["Ticket"] = label.fit_transform(data["Ticket"])
sns.barplot(x="Ticket", y=target, data=data[0:891])

在这里插入图片描述

4.3.8 处理类别特征

使用get_dummies方法进行OneHot编码

data.drop(["SibSp", "Parch"], inplace=True, axis=1)
data = pd.get_dummies(data=data, columns=["Pclass", "Name", "Sex", "Age", "Cabin", "Embarked", "Family_Size", "Ticket", "Fare"], drop_first=True)

分成train和test两部分

#Splitting into train and test again
train = data[:891]
test = data[891:]

4.4 机器学习

4.4.1 代码

使用不同的模型进行测试

cv_split = model_selection.ShuffleSplit(n_splits=10, test_size=.3, train_size=.7, random_state=42)

使用这些分类器分别进行分类

classifiers = [
    SVC(random_state=42),
    DecisionTreeClassifier(random_state=42),
    AdaBoostClassifier(DecisionTreeClassifier(random_state=42),random_state=42,learning_rate=0.1),
    RandomForestClassifier(random_state=42),
    ExtraTreesClassifier(random_state=42),
    GradientBoostingClassifier(random_state=42),
    MLPClassifier(random_state=42),
    KNeighborsClassifier(),
    LogisticRegression(random_state=42),
    LinearDiscriminantAnalysis()
]
cv_train_mean = []
cv_test_mean = []
cv_score_time = []
cv_fit_time = []
cv_name = []
predictions = []

for classifier in classifiers :
    cv_results = model_selection.cross_validate(classifier, train.drop(['PassengerId'], axis=1), target, cv=cv_split, return_train_score=True)
    cv_train_mean.append(cv_results['train_score'].mean())
    cv_test_mean.append(cv_results['test_score'].mean())
    cv_score_time.append(cv_results['score_time'].mean())
    cv_fit_time.append(cv_results['fit_time'].mean())
    cv_name.append(str(classifier.__class__.__name__))
    classifier.fit(train.drop(['PassengerId'], axis=1), target)
    predictions.append(classifier.predict(test.drop(['PassengerId'], axis=1)))
    print(str(classifier.__class__.__name__))
    

performance_df = pd.DataFrame({"Algorithm":cv_name, "Train Score":cv_train_mean, "Test Score":cv_test_mean, 'Score Time':cv_score_time, 'Fit Time':cv_fit_time})

4.4.2 查看结果

performance_df

(图)

4.4.3 绘制结果

#Plotting the performance on test set
sns.barplot('Test Score', 'Algorithm', data=performance_df)

(图)

4.4.4 交叉分析

#Plotting prediction correlation of the algorithms
sns.heatmap(pd.DataFrame(predictions, index=cv_name).T.corr(), annot=True)

在这里插入图片描述

4.5 调参

4.5.1 模型定义及参数空间选择

tuned_clf = {
    'DecisionTreeClassifier':DecisionTreeClassifier(random_state=42),
    'AdaBoostClassifier':AdaBoostClassifier(DecisionTreeClassifier(random_state=42),random_state=42,learning_rate=0.1),
    'RandomForestClassifier':RandomForestClassifier(random_state=42),
    'ExtraTreesClassifier':ExtraTreesClassifier(random_state=42),
    
    'GradientBoostingClassifier':GradientBoostingClassifier(random_state=42),
    'MLPClassifier':MLPClassifier(random_state=42),
    'LogisticRegression':LogisticRegression(random_state=42),
    'LinearDiscriminantAnalysis':LinearDiscriminantAnalysis()
}

#DecisionTreeClassifier
grid = {'criterion': ['gini', 'entropy'], 'splitter': ['best', 'random'], 'max_depth': [2,4,6,8,10,None], 
        'min_samples_split': [2,5,10,.03,.05], 'min_samples_leaf': [1,5,10,.03,.05], 'max_features': [None, 'auto']}

tune_model = model_selection.GridSearchCV(tuned_clf['DecisionTreeClassifier'], param_grid=grid, scoring = 'roc_auc', cv = cv_split, return_train_score=True)
tune_model.fit(train.drop(['PassengerId'], axis=1), target)
print("Best Parameters:")
print(tune_model.best_params_)
tuned_clf['DecisionTreeClassifier'].set_params(**tune_model.best_params_)

#AdaBoostClassifier
grid = {'n_estimators': [10, 50, 100, 300], 'learning_rate': [.01, .03, .05, .1, .25], 'algorithm': ['SAMME', 'SAMME.R'] }

tune_model = model_selection.GridSearchCV(tuned_clf['AdaBoostClassifier'], param_grid=grid, scoring = 'roc_auc', cv = cv_split, return_train_score=True)
tune_model.fit(train.drop(['PassengerId'], axis=1), target)
print("Best Parameters:")
print(tune_model.best_params_)
tuned_clf['AdaBoostClassifier'].set_params(**tune_model.best_params_)

#RandomForestClassifier
grid = {'n_estimators': [10, 50, 100, 300], 'criterion': ['gini', 'entropy'], 'max_depth': [2, 4, 6, 8, 10, None], 
        'oob_score': [True] }
 
tune_model = model_selection.GridSearchCV(tuned_clf['RandomForestClassifier'], param_grid=grid, scoring = 'roc_auc', cv = cv_split, return_train_score=True)
tune_model.fit(train.drop(['PassengerId'], axis=1), target)
print("Best Parameters:")
print(tune_model.best_params_)
tuned_clf['RandomForestClassifier'].set_params(**tune_model.best_params_)

#ExtraTreesClassifier
grid = {'n_estimators': [10, 50, 100, 300], 'criterion': ['gini', 'entropy'], 'max_depth': [2, 4, 6, 8, 10, None]}
 
tune_model = model_selection.GridSearchCV(tuned_clf['ExtraTreesClassifier'], param_grid=grid, scoring = 'roc_auc', cv = cv_split, return_train_score=True)
tune_model.fit(train.drop(['PassengerId'], axis=1), target)
print("Best Parameters:")
print(tune_model.best_params_)
tuned_clf['ExtraTreesClassifier'].set_params(**tune_model.best_params_)

#GradientBoostingClassifier
grid = {#'loss': ['deviance', 'exponential'], 'learning_rate': [.01, .03, .05, .1, .25], 
        'n_estimators': [300],
        #'criterion': ['friedman_mse', 'mse', 'mae'], 
        'max_depth': [4] }

tune_model = model_selection.GridSearchCV(tuned_clf['GradientBoostingClassifier'], param_grid=grid, scoring = 'roc_auc', cv = cv_split, return_train_score=True)
tune_model.fit(train.drop(['PassengerId'], axis=1), target)
print("Best Parameters:")
print(tune_model.best_params_)
tuned_clf['GradientBoostingClassifier'].set_params(**tune_model.best_params_)

#MLPClassifier
grid = {'learning_rate': ["constant", "invscaling", "adaptive"], 'alpha': 10.0 ** -np.arange(1, 7), 'activation': ["logistic", "relu", "tanh"]}

tune_model = model_selection.GridSearchCV(tuned_clf['MLPClassifier'], param_grid=grid, scoring = 'roc_auc', cv = cv_split, return_train_score=True)
tune_model.fit(train.drop(['PassengerId'], axis=1), target)
print("Best Parameters:")
print(tune_model.best_params_)
tuned_clf['MLPClassifier'].set_params(**tune_model.best_params_)

#LogisticRegression
grid = {'fit_intercept': [True, False], 'penalty': ['l2'], 'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
 
tune_model = model_selection.GridSearchCV(tuned_clf['LogisticRegression'], param_grid=grid, scoring = 'roc_auc', cv = cv_split, return_train_score=True)
tune_model.fit(train.drop(['PassengerId'], axis=1), target)
print("Best Parameters:")
print(tune_model.best_params_)
tuned_clf['LogisticRegression'].set_params(**tune_model.best_params_)

#LinearDiscriminantAnalysis
grid = {"solver" : ["svd"], "tol" : [0.0001,0.0002,0.0003]}
 
tune_model = model_selection.GridSearchCV(tuned_clf['LinearDiscriminantAnalysis'], param_grid=grid, scoring = 'roc_auc', cv = cv_split, return_train_score=True)
tune_model.fit(train.drop(['PassengerId'], axis=1), target)
print("Best Parameters:")
print(tune_model.best_params_)
tuned_clf['LinearDiscriminantAnalysis'].set_params(**tune_model.best_params_)

4.5.2 评估每个模型

#Evaluating the performance of our tuned models
cv_train_mean = []
cv_test_mean = []
cv_score_time = []
cv_fit_time = []
cv_name = []
predictions = []

for _, classifier in tuned_clf.items():
    cv_results = model_selection.cross_validate(classifier, train.drop(['PassengerId'], axis=1), target, cv=cv_split, return_train_score=True)
    cv_train_mean.append(cv_results['train_score'].mean())
    cv_test_mean.append(cv_results['test_score'].mean())
    cv_score_time.append(cv_results['score_time'].mean())
    cv_fit_time.append(cv_results['fit_time'].mean())
    cv_name.append(str(classifier.__class__.__name__))
    classifier.fit(train.drop(['PassengerId'], axis=1), target)
    predictions.append(classifier.predict(test.drop(['PassengerId'], axis=1)))
    

performance_df = pd.DataFrame({"Algorithm":cv_name, "Train Score":cv_train_mean, "Test Score":cv_test_mean, 'Score Time':cv_score_time, 'Fit Time':cv_fit_time})
performance_df

4.5.3 投票分类器进行最终预测

# 投票分类器
voting_3 = VotingClassifier(estimators=[
    ('DecisionTreeClassifier', tuned_clf['DecisionTreeClassifier']), 
    ('AdaBoostClassifier', tuned_clf['AdaBoostClassifier']),
    ('RandomForestClassifier', tuned_clf['RandomForestClassifier']), 
    ('ExtraTreesClassifier',tuned_clf['ExtraTreesClassifier']),
    ('GradientBoostingClassifier',tuned_clf['GradientBoostingClassifier']), 
    ('MLPClassifier',tuned_clf['MLPClassifier']), 
    ('LogisticRegression', tuned_clf['LogisticRegression']), 
    ('LinearDiscriminantAnalysis', tuned_clf['LinearDiscriminantAnalysis'])], voting='soft', n_jobs=4)

voting_3 = voting_3.fit(train.drop(['PassengerId'], axis=1), target)
pred_3 = voting_3.predict(test.drop(['PassengerId'], axis=1))
sol3 = pd.DataFrame(data=pred_3, columns=['Survived'], index=test['PassengerId'])
sol3.to_csv("data/titan/sol3.csv")
发布了113 篇原创文章 · 获赞 132 · 访问量 20万+

猜你喜欢

转载自blog.csdn.net/iamqianrenzhan/article/details/103040375