集成算法之随机森林

集成算法
Ensemble learning，目的是让机器学习效果更好，一个完不成，那就多个。
分类
Bossting:从弱机器学习开始加强，通过加权来进行训练
Bagging 训练多个分类器取平均
比如训练一个决策树没办法达到要求，所以训练100个决策树取平均。
最典型的就是随机森林（并行训练一堆分类器）
Stacking聚合多个分类或者回归模型。可以堆叠各种各样的分类器(KNN,SVM等)。第一阶段获得各自结果，第二阶段再用前一阶段的结果进行训练
随机森林
随机：数据采样随机，特征选择随机（不然100个树预测结果一丝不差，没有意义）。森林，多个决策树并行放在一起。记住是并行，并排放，数据并行进入三个树
比如用随机森林做回归，100个树，结果求平均值，如果用来做分类，那么90个结果属于A，10个属于B，那么就按照A作为最终结果
随机森林的随机采样流程：比如100个样本要做三个树，第一个树拿60个样本，做出树，然后放回去，第二个也拿出来60个，放回去，有放回的选择。这样三个树都不一样
第二重的随机性就是特征的随机性，比如10个特征，一个树随机选择某几个特征。保证最后的随机性。但是最终保证数据量和特征量是一样的。
随机森林在训练完后，能够给出哪些feature比较重要。
如何判断？比如ABCD四种特征，建模计算错误率为error1，然后把B，比如列是年龄，把他变成垃圾数据。再重新建模，计算错误率，发现错误率为error2,如果两个错误率差不太多，证明B有没有都无所谓，他不重要，反之他很重要。
那么这个列就可要可不要。
随机森林的树在理论上越多，效果越好，但是实际上超过某个数量，随着树的增多，精确度就不再变化。

以泰坦尼克号船员获救数据来演示随机森林：

import pandas as pd
import numpy as np
titanic = pd.read_csv("titanic_train.csv")
%matplotlib inline
titanic.head(5)

很重要的数据预处理，因为大部分机器学习都是矩阵运算。不能有NAN或者其他的空数据，所以数据预处理是非常有必要

#把空的age填充成平均值
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

对于汉字也无法进行矩阵运算，所以对汉字要进行预先处理

#把性别转换成数字
#男的是0，女的是1
titanic.loc[titanic["Sex"] == "male","Sex"] = 0
titanic.loc[titanic["Sex"] == "female","Sex"] = 1
titanic.head()

#把停靠港口也改成数字，不然无法参与运算
#把空的填充成S
titanic["Embarked"] = titanic["Embarked"].fillna("S")
titanic.loc[titanic["Embarked"] == "S","Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C","Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q","Embarked"] = 2

生成训练数据和测试数据

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
#这里导入也可以导入两个，与随机数一样，既可以做分类也可以做回归
predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]
#选择要分类的特征，选择什么样的特征后面会对特征的权重进行选择，现在先选择进行实验
x= titanic[predictors]
y= titanic["Survived"]
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.33,random_state=42)

from sklearn.grid_search import GridSearchCV #最好的方法就是让我们自己选择一堆参数进行遍历，看看哪个参数最好，这个库可以帮我们这一点
tree_param_grid = {'min_samples_split':list((3,6,9,12)),'max_depth':list((,7,9,10,11)),'n_estimators':list((30,40,50,100)),'min_samples_leaf':list((1,3,5))}#选取要循环的参数，就是要测试的参数
grid = GridSearchCV(RandomForestClassifier(),param_grid=tree_param_grid,cv=5)#选择模型，选择CV，就是交叉验证，如果不进行
grid.fit(x_train,y_train)
grid.grid_scores_, grid.best_params_, grid.best_score_
#交叉验证，为了确定选择的参数是否准确，交差验证的原理是，把本身拿到的训练数据分成三份，分别叫做1,2,3，先把1和2建立模型，3作为测试数据，然后验证参数到底怎么样，然后2,3作为建立模型，看1作为测试数据，然后以此类推训练三次，这样做可以排除误差。不然如果验证集全是简单的数据会导致模型效果偏高或者偏低

结果如下：

([mean: 0.82047, std: 0.03928, params: {'max_depth': 7, 'n_estimators': 30, 'min_samples_split': 3, 'min_samples_leaf': 1},
  mean: 0.81879, std: 0.03195, params: {'max_depth': 7, 'n_estimators': 40, 'min_samples_split': 3, 'min_samples_leaf': 1},
  mean: 0.81711, std: 0.03252, params: {'max_depth': 7, 'n_estimators': 50, 'min_samples_split': 3, 'min_samples_leaf': 1},
  mean: 0.81544, std: 0.02031, params: {'max_depth': 7, 'n_estimators': 100, 'min_samples_split': 3, 'min_samples_leaf': 1},
  mean: 0.80705, std: 0.03967, params: {'max_depth': 7, 'n_estimators': 30, 'min_samples_split': 6, 'min_samples_leaf': 1},
  mean: 0.82047, std: 0.01782, params: {'max_depth': 7, 'n_estimators': 40, 'min_samples_split': 6, 'min_samples_leaf': 1},
  mean: 0.82383, std: 0.04042, params: {'max_depth': 7, 'n_estimators': 50, 'min_samples_split': 6, 'min_samples_leaf': 1},
  mean: 0.81376, std: 0.02847, params: {'max_depth': 7, 'n_estimators': 100, 'min_samples_split': 6, 'min_samples_leaf': 1},
  mean: 0.81711, std: 0.03873, params: {'max_depth': 7, 'n_estimators': 30, 'min_samples_split': 9, 'min_samples_leaf': 1},
  mean: 0.81879, std: 0.02923, params: {'max_depth': 7, 'n_estimators': 40, 'min_samples_split': 9, 'min_samples_leaf': 1},
  mean: 0.81711, std: 0.04062, params: {'max_depth': 7, 'n_estimators': 50, 'min_samples_split': 9, 'min_samples_leaf': 1},
  mean: 0.82886, std: 0.03527, params: {'max_depth': 7, 'n_estimators': 100, 'min_samples_split': 9, 'min_samples_leaf': 1},
  mean: 0.80201, std: 0.04188, params: {'max_depth': 7, 'n_estimators': 30, 'min_samples_split': 12, 'min_samples_leaf': 1},
  mean: 0.83221, std: 0.03644, params: {'max_depth': 7, 'n_estimators': 40, 'min_samples_split': 12, 'min_samples_leaf': 1},
  mean: 0.82886, std: 0.03957, params: {'max_depth': 7, 'n_estimators': 50, 'min_samples_split': 12, 'min_samples_leaf': 1},
  mean: 0.82215, std: 0.03419, params: {'max_depth': 7, 'n_estimators': 100, 'min_samples_split': 12, 'min_samples_leaf': 1},
  mean: 0.81208, std: 0.03297, params: {'max_depth': 7, 'n_estimators': 30, 'min_samples_split': 3, 'min_samples_leaf': 3},
  mean: 0.81711, std: 0.03735, params: {'max_depth': 7, 'n_estimators': 40, 'min_samples_split': 3, 'min_samples_leaf': 3},
  mean: 0.81544, std: 0.03482, params: {'max_depth': 7, 'n_estimators': 50, 'min_samples_split': 3, 'min_samples_leaf': 3},
  mean: 0.82383, std: 0.04147, params: {'max_depth': 7, 'n_estimators': 100, 'min_samples_split': 3, 'min_samples_leaf': 3},
  mean: 0.80872, std: 0.03383, params: {'max_depth': 7, 'n_estimators': 30, 'min_samples_split': 6, 'min_samples_leaf': 3},
  mean: 0.81879, std: 0.03857, params: {'max_depth': 7, 'n_estimators': 40, 'min_samples_split': 6, 'min_samples_leaf': 3},
  mean: 0.82550, std: 0.03886, params: {'max_depth': 7, 'n_estimators': 50, 'min_samples_split': 6, 'min_samples_leaf': 3},
  mean: 0.81711, std: 0.03705, params: {'max_depth': 7, 'n_estimators': 100, 'min_samples_split': 6, 'min_samples_leaf': 3},
  mean: 0.81208, std: 0.03211, params: {'max_depth': 7, 'n_estimators': 30, 'min_samples_split': 9, 'min_samples_leaf': 3},
  mean: 0.81879, std: 0.03773, params: {'max_depth': 7, 'n_estimators': 40, 'min_samples_split': 9, 'min_samples_leaf': 3},
  mean: 0.82383, std: 0.04470, params: {'max_depth': 7, 'n_estimators': 50, 'min_samples_split': 9, 'min_samples_leaf': 3},
  mean: 0.82215, std: 0.04447, params: {'max_depth': 7, 'n_estimators': 100, 'min_samples_split': 9, 'min_samples_leaf': 3},
  mean: 0.82383, std: 0.02523, params: {'max_depth': 7, 'n_estimators': 30, 'min_samples_split': 12, 'min_samples_leaf': 3},
  mean: 0.82215, std: 0.03883, params: {'max_depth': 7, 'n_estimators': 40, 'min_samples_split': 12, 'min_samples_leaf': 3},
  mean: 0.82047, std: 0.04098, params: {'max_depth': 7, 'n_estimators': 50, 'min_samples_split': 12, 'min_samples_leaf': 3},
  mean: 0.82215, std: 0.03735, params: {'max_depth': 7, 'n_estimators': 100, 'min_samples_split': 12, 'min_samples_leaf': 3},
  mean: 0.81711, std: 0.03042, params: {'max_depth': 7, 'n_estimators': 30, 'min_samples_split': 3, 'min_samples_leaf': 5},
  mean: 0.82550, std: 0.04505, params: {'max_depth': 7, 'n_estimators': 40, 'min_samples_split': 3, 'min_samples_leaf': 5},
  mean: 0.81879, std: 0.04288, params: {'max_depth': 7, 'n_estimators': 50, 'min_samples_split': 3, 'min_samples_leaf': 5},
  mean: 0.81544, std: 0.04111, params: {'max_depth': 7, 'n_estimators': 100, 'min_samples_split': 3, 'min_samples_leaf': 5},
  mean: 0.81208, std: 0.04588, params: {'max_depth': 7, 'n_estimators': 30, 'min_samples_split': 6, 'min_samples_leaf': 5},
  mean: 0.82383, std: 0.04618, params: {'max_depth': 7, 'n_estimators': 40, 'min_samples_split': 6, 'min_samples_leaf': 5},
  mean: 0.80705, std: 0.03000, params: {'max_depth': 7, 'n_estimators': 50, 'min_samples_split': 6, 'min_samples_leaf': 5},
  mean: 0.81544, std: 0.04282, params: {'max_depth': 7, 'n_estimators': 100, 'min_samples_split': 6, 'min_samples_leaf': 5},
  mean: 0.80705, std: 0.04726, params: {'max_depth': 7, 'n_estimators': 30, 'min_samples_split': 9, 'min_samples_leaf': 5},
  mean: 0.81376, std: 0.04281, params: {'max_depth': 7, 'n_estimators': 40, 'min_samples_split': 9, 'min_samples_leaf': 5},
  mean: 0.80201, std: 0.03525, params: {'max_depth': 7, 'n_estimators': 50, 'min_samples_split': 9, 'min_samples_leaf': 5},
  mean: 0.81208, std: 0.04432, params: {'max_depth': 7, 'n_estimators': 100, 'min_samples_split': 9, 'min_samples_leaf': 5},
  mean: 0.81376, std: 0.03894, params: {'max_depth': 7, 'n_estimators': 30, 'min_samples_split': 12, 'min_samples_leaf': 5},
  mean: 0.81711, std: 0.04330, params: {'max_depth': 7, 'n_estimators': 40, 'min_samples_split': 12, 'min_samples_leaf': 5},
  mean: 0.81544, std: 0.04477, params: {'max_depth': 7, 'n_estimators': 50, 'min_samples_split': 12, 'min_samples_leaf': 5},
  mean: 0.81544, std: 0.04275, params: {'max_depth': 7, 'n_estimators': 100, 'min_samples_split': 12, 'min_samples_leaf': 5},
  mean: 0.81376, std: 0.02758, params: {'max_depth': 9, 'n_estimators': 30, 'min_samples_split': 3, 'min_samples_leaf': 1},
  mean: 0.80872, std: 0.01414, params: {'max_depth': 9, 'n_estimators': 40, 'min_samples_split': 3, 'min_samples_leaf': 1},
  mean: 0.81208, std: 0.02809, params: {'max_depth': 9, 'n_estimators': 50, 'min_samples_split': 3, 'min_samples_leaf': 1},
  mean: 0.82383, std: 0.02138, params: {'max_depth': 9, 'n_estimators': 100, 'min_samples_split': 3, 'min_samples_leaf': 1},
  mean: 0.81711, std: 0.03304, params: {'max_depth': 9, 'n_estimators': 30, 'min_samples_split': 6, 'min_samples_leaf': 1},
  mean: 0.81711, std: 0.02604, params: {'max_depth': 9, 'n_estimators': 40, 'min_samples_split': 6, 'min_samples_leaf': 1},
  mean: 0.81879, std: 0.03027, params: {'max_depth': 9, 'n_estimators': 50, 'min_samples_split': 6, 'min_samples_leaf': 1},
  mean: 0.81711, std: 0.02054, params: {'max_depth': 9, 'n_estimators': 100, 'min_samples_split': 6, 'min_samples_leaf': 1},
  mean: 0.83054, std: 0.02275, params: {'max_depth': 9, 'n_estimators': 30, 'min_samples_split': 9, 'min_samples_leaf': 1},
  mean: 0.83054, std: 0.03720, params: {'max_depth': 9, 'n_estimators': 40, 'min_samples_split': 9, 'min_samples_leaf': 1},
  mean: 0.82215, std: 0.03102, params: {'max_depth': 9, 'n_estimators': 50, 'min_samples_split': 9, 'min_samples_leaf': 1},
  mean: 0.82047, std: 0.02823, params: {'max_depth': 9, 'n_estimators': 100, 'min_samples_split': 9, 'min_samples_leaf': 1},
  mean: 0.81544, std: 0.04188, params: {'max_depth': 9, 'n_estimators': 30, 'min_samples_split': 12, 'min_samples_leaf': 1},
  mean: 0.81879, std: 0.03734, params: {'max_depth': 9, 'n_estimators': 40, 'min_samples_split': 12, 'min_samples_leaf': 1},
  mean: 0.82047, std: 0.02721, params: {'max_depth': 9, 'n_estimators': 50, 'min_samples_split': 12, 'min_samples_leaf': 1},
  mean: 0.82215, std: 0.02509, params: {'max_depth': 9, 'n_estimators': 100, 'min_samples_split': 12, 'min_samples_leaf': 1},
  mean: 0.81711, std: 0.03484, params: {'max_depth': 9, 'n_estimators': 30, 'min_samples_split': 3, 'min_samples_leaf': 3},
  mean: 0.80872, std: 0.03810, params: {'max_depth': 9, 'n_estimators': 40, 'min_samples_split': 3, 'min_samples_leaf': 3},
  mean: 0.81879, std: 0.03856, params: {'max_depth': 9, 'n_estimators': 50, 'min_samples_split': 3, 'min_samples_leaf': 3},
  mean: 0.81376, std: 0.02759, params: {'max_depth': 9, 'n_estimators': 100, 'min_samples_split': 3, 'min_samples_leaf': 3},
  mean: 0.81040, std: 0.03472, params: {'max_depth': 9, 'n_estimators': 30, 'min_samples_split': 6, 'min_samples_leaf': 3},
  mean: 0.81711, std: 0.03452, params: {'max_depth': 9, 'n_estimators': 40, 'min_samples_split': 6, 'min_samples_leaf': 3},
  mean: 0.81544, std: 0.02785, params: {'max_depth': 9, 'n_estimators': 50, 'min_samples_split': 6, 'min_samples_leaf': 3},
  mean: 0.81879, std: 0.04522, params: {'max_depth': 9, 'n_estimators': 100, 'min_samples_split': 6, 'min_samples_leaf': 3},
  mean: 0.81544, std: 0.02951, params: {'max_depth': 9, 'n_estimators': 30, 'min_samples_split': 9, 'min_samples_leaf': 3},
  mean: 0.81879, std: 0.03220, params: {'max_depth': 9, 'n_estimators': 40, 'min_samples_split': 9, 'min_samples_leaf': 3},
  mean: 0.81208, std: 0.03139, params: {'max_depth': 9, 'n_estimators': 50, 'min_samples_split': 9, 'min_samples_leaf': 3},
  mean: 0.81544, std: 0.03841, params: {'max_depth': 9, 'n_estimators': 100, 'min_samples_split': 9, 'min_samples_leaf': 3},
  mean: 0.82886, std: 0.03641, params: {'max_depth': 9, 'n_estimators': 30, 'min_samples_split': 12, 'min_samples_leaf': 3},
  mean: 0.81208, std: 0.03001, params: {'max_depth': 9, 'n_estimators': 40, 'min_samples_split': 12, 'min_samples_leaf': 3},
  mean: 0.82047, std: 0.02850, params: {'max_depth': 9, 'n_estimators': 50, 'min_samples_split': 12, 'min_samples_leaf': 3},
  mean: 0.81376, std: 0.03083, params: {'max_depth': 9, 'n_estimators': 100, 'min_samples_split': 12, 'min_samples_leaf': 3},
  mean: 0.81544, std: 0.03063, params: {'max_depth': 9, 'n_estimators': 30, 'min_samples_split': 3, 'min_samples_leaf': 5},
  mean: 0.81879, std: 0.03958, params: {'max_depth': 9, 'n_estimators': 40, 'min_samples_split': 3, 'min_samples_leaf': 5},
  mean: 0.81208, std: 0.03139, params: {'max_depth': 9, 'n_estimators': 50, 'min_samples_split': 3, 'min_samples_leaf': 5},
  mean: 0.80872, std: 0.04542, params: {'max_depth': 9, 'n_estimators': 100, 'min_samples_split': 3, 'min_samples_leaf': 5},
  mean: 0.80537, std: 0.04851, params: {'max_depth': 9, 'n_estimators': 30, 'min_samples_split': 6, 'min_samples_leaf': 5},
  mean: 0.82047, std: 0.03896, params: {'max_depth': 9, 'n_estimators': 40, 'min_samples_split': 6, 'min_samples_leaf': 5},
  mean: 0.81208, std: 0.03620, params: {'max_depth': 9, 'n_estimators': 50, 'min_samples_split': 6, 'min_samples_leaf': 5},
  mean: 0.81879, std: 0.04395, params: {'max_depth': 9, 'n_estimators': 100, 'min_samples_split': 6, 'min_samples_leaf': 5},
  mean: 0.81711, std: 0.04502, params: {'max_depth': 9, 'n_estimators': 30, 'min_samples_split': 9, 'min_samples_leaf': 5},
  mean: 0.81208, std: 0.04458, params: {'max_depth': 9, 'n_estimators': 40, 'min_samples_split': 9, 'min_samples_leaf': 5},
  mean: 0.82215, std: 0.03410, params: {'max_depth': 9, 'n_estimators': 50, 'min_samples_split': 9, 'min_samples_leaf': 5},
  mean: 0.81040, std: 0.03958, params: {'max_depth': 9, 'n_estimators': 100, 'min_samples_split': 9, 'min_samples_leaf': 5},
  mean: 0.81544, std: 0.04882, params: {'max_depth': 9, 'n_estimators': 30, 'min_samples_split': 12, 'min_samples_leaf': 5},
  mean: 0.80537, std: 0.04086, params: {'max_depth': 9, 'n_estimators': 40, 'min_samples_split': 12, 'min_samples_leaf': 5},
  mean: 0.81376, std: 0.03915, params: {'max_depth': 9, 'n_estimators': 50, 'min_samples_split': 12, 'min_samples_leaf': 5},
  mean: 0.81208, std: 0.04194, params: {'max_depth': 9, 'n_estimators': 100, 'min_samples_split': 12, 'min_samples_leaf': 5},
  mean: 0.82383, std: 0.02112, params: {'max_depth': 10, 'n_estimators': 30, 'min_samples_split': 3, 'min_samples_leaf': 1},
  mean: 0.82215, std: 0.02037, params: {'max_depth': 10, 'n_estimators': 40, 'min_samples_split': 3, 'min_samples_leaf': 1},
  mean: 0.81711, std: 0.01668, params: {'max_depth': 10, 'n_estimators': 50, 'min_samples_split': 3, 'min_samples_leaf': 1},
  mean: 0.81711, std: 0.02023, params: {'max_depth': 10, 'n_estimators': 100, 'min_samples_split': 3, 'min_samples_leaf': 1},
  mean: 0.81711, std: 0.02966, params: {'max_depth': 10, 'n_estimators': 30, 'min_samples_split': 6, 'min_samples_leaf': 1},
  mean: 0.82215, std: 0.02481, params: {'max_depth': 10, 'n_estimators': 40, 'min_samples_split': 6, 'min_samples_leaf': 1},
  mean: 0.80537, std: 0.02147, params: {'max_depth': 10, 'n_estimators': 50, 'min_samples_split': 6, 'min_samples_leaf': 1},
  mean: 0.82215, std: 0.02837, params: {'max_depth': 10, 'n_estimators': 100, 'min_samples_split': 6, 'min_samples_leaf': 1},
  mean: 0.81376, std: 0.03023, params: {'max_depth': 10, 'n_estimators': 30, 'min_samples_split': 9, 'min_samples_leaf': 1},
  mean: 0.81711, std: 0.02895, params: {'max_depth': 10, 'n_estimators': 40, 'min_samples_split': 9, 'min_samples_leaf': 1},
  mean: 0.82047, std: 0.03477, params: {'max_depth': 10, 'n_estimators': 50, 'min_samples_split': 9, 'min_samples_leaf': 1},
  mean: 0.82550, std: 0.01932, params: {'max_depth': 10, 'n_estimators': 100, 'min_samples_split': 9, 'min_samples_leaf': 1},
  mean: 0.81711, std: 0.03365, params: {'max_depth': 10, 'n_estimators': 30, 'min_samples_split': 12, 'min_samples_leaf': 1},
  mean: 0.81040, std: 0.03075, params: {'max_depth': 10, 'n_estimators': 40, 'min_samples_split': 12, 'min_samples_leaf': 1},
  mean: 0.82383, std: 0.03598, params: {'max_depth': 10, 'n_estimators': 50, 'min_samples_split': 12, 'min_samples_leaf': 1},
  mean: 0.82215, std: 0.04053, params: {'max_depth': 10, 'n_estimators': 100, 'min_samples_split': 12, 'min_samples_leaf': 1},
  mean: 0.80537, std: 0.03768, params: {'max_depth': 10, 'n_estimators': 30, 'min_samples_split': 3, 'min_samples_leaf': 3},
  mean: 0.81376, std: 0.03675, params: {'max_depth': 10, 'n_estimators': 40, 'min_samples_split': 3, 'min_samples_leaf': 3},
  mean: 0.82383, std: 0.04083, params: {'max_depth': 10, 'n_estimators': 50, 'min_samples_split': 3, 'min_samples_leaf': 3},
  mean: 0.81376, std: 0.03886, params: {'max_depth': 10, 'n_estimators': 100, 'min_samples_split': 3, 'min_samples_leaf': 3},
  mean: 0.80705, std: 0.04205, params: {'max_depth': 10, 'n_estimators': 30, 'min_samples_split': 6, 'min_samples_leaf': 3},
  mean: 0.81879, std: 0.03334, params: {'max_depth': 10, 'n_estimators': 40, 'min_samples_split': 6, 'min_samples_leaf': 3},
  mean: 0.80872, std: 0.04608, params: {'max_depth': 10, 'n_estimators': 50, 'min_samples_split': 6, 'min_samples_leaf': 3},
  mean: 0.81711, std: 0.04602, params: {'max_depth': 10, 'n_estimators': 100, 'min_samples_split': 6, 'min_samples_leaf': 3},
  mean: 0.81544, std: 0.04436, params: {'max_depth': 10, 'n_estimators': 30, 'min_samples_split': 9, 'min_samples_leaf': 3},
  mean: 0.82047, std: 0.04168, params: {'max_depth': 10, 'n_estimators': 40, 'min_samples_split': 9, 'min_samples_leaf': 3},
  mean: 0.82215, std: 0.04163, params: {'max_depth': 10, 'n_estimators': 50, 'min_samples_split': 9, 'min_samples_leaf': 3},
  mean: 0.81544, std: 0.03722, params: {'max_depth': 10, 'n_estimators': 100, 'min_samples_split': 9, 'min_samples_leaf': 3},
  mean: 0.81208, std: 0.04296, params: {'max_depth': 10, 'n_estimators': 30, 'min_samples_split': 12, 'min_samples_leaf': 3},
  mean: 0.80705, std: 0.04342, params: {'max_depth': 10, 'n_estimators': 40, 'min_samples_split': 12, 'min_samples_leaf': 3},
  mean: 0.81040, std: 0.04634, params: {'max_depth': 10, 'n_estimators': 50, 'min_samples_split': 12, 'min_samples_leaf': 3},
  mean: 0.82047, std: 0.05103, params: {'max_depth': 10, 'n_estimators': 100, 'min_samples_split': 12, 'min_samples_leaf': 3},
  mean: 0.81208, std: 0.03698, params: {'max_depth': 10, 'n_estimators': 30, 'min_samples_split': 3, 'min_samples_leaf': 5},
  mean: 0.81040, std: 0.04453, params: {'max_depth': 10, 'n_estimators': 40, 'min_samples_split': 3, 'min_samples_leaf': 5},
  mean: 0.80537, std: 0.03317, params: {'max_depth': 10, 'n_estimators': 50, 'min_samples_split': 3, 'min_samples_leaf': 5},
  mean: 0.81376, std: 0.04610, params: {'max_depth': 10, 'n_estimators': 100, 'min_samples_split': 3, 'min_samples_leaf': 5},
  mean: 0.81879, std: 0.04334, params: {'max_depth': 10, 'n_estimators': 30, 'min_samples_split': 6, 'min_samples_leaf': 5},
  mean: 0.81879, std: 0.04093, params: {'max_depth': 10, 'n_estimators': 40, 'min_samples_split': 6, 'min_samples_leaf': 5},
  mean: 0.81544, std: 0.03734, params: {'max_depth': 10, 'n_estimators': 50, 'min_samples_split': 6, 'min_samples_leaf': 5},
  mean: 0.81711, std: 0.04023, params: {'max_depth': 10, 'n_estimators': 100, 'min_samples_split': 6, 'min_samples_leaf': 5},
  mean: 0.81376, std: 0.04715, params: {'max_depth': 10, 'n_estimators': 30, 'min_samples_split': 9, 'min_samples_leaf': 5},
  mean: 0.81376, std: 0.04857, params: {'max_depth': 10, 'n_estimators': 40, 'min_samples_split': 9, 'min_samples_leaf': 5},
  mean: 0.81376, std: 0.03714, params: {'max_depth': 10, 'n_estimators': 50, 'min_samples_split': 9, 'min_samples_leaf': 5},
  mean: 0.82215, std: 0.04132, params: {'max_depth': 10, 'n_estimators': 100, 'min_samples_split': 9, 'min_samples_leaf': 5},
  mean: 0.80872, std: 0.03463, params: {'max_depth': 10, 'n_estimators': 30, 'min_samples_split': 12, 'min_samples_leaf': 5},
  mean: 0.81376, std: 0.03825, params: {'max_depth': 10, 'n_estimators': 40, 'min_samples_split': 12, 'min_samples_leaf': 5},
  mean: 0.81544, std: 0.04522, params: {'max_depth': 10, 'n_estimators': 50, 'min_samples_split': 12, 'min_samples_leaf': 5},
  mean: 0.81208, std: 0.04145, params: {'max_depth': 10, 'n_estimators': 100, 'min_samples_split': 12, 'min_samples_leaf': 5},
  mean: 0.80201, std: 0.02388, params: {'max_depth': 11, 'n_estimators': 30, 'min_samples_split': 3, 'min_samples_leaf': 1},
  mean: 0.80705, std: 0.02790, params: {'max_depth': 11, 'n_estimators': 40, 'min_samples_split': 3, 'min_samples_leaf': 1},
  mean: 0.81376, std: 0.02411, params: {'max_depth': 11, 'n_estimators': 50, 'min_samples_split': 3, 'min_samples_leaf': 1},
  mean: 0.79530, std: 0.02708, params: {'max_depth': 11, 'n_estimators': 100, 'min_samples_split': 3, 'min_samples_leaf': 1},
  mean: 0.82383, std: 0.02152, params: {'max_depth': 11, 'n_estimators': 30, 'min_samples_split': 6, 'min_samples_leaf': 1},
  mean: 0.81376, std: 0.02458, params: {'max_depth': 11, 'n_estimators': 40, 'min_samples_split': 6, 'min_samples_leaf': 1},
  mean: 0.81879, std: 0.01957, params: {'max_depth': 11, 'n_estimators': 50, 'min_samples_split': 6, 'min_samples_leaf': 1},
  mean: 0.81711, std: 0.01931, params: {'max_depth': 11, 'n_estimators': 100, 'min_samples_split': 6, 'min_samples_leaf': 1},
  mean: 0.81711, std: 0.03027, params: {'max_depth': 11, 'n_estimators': 30, 'min_samples_split': 9, 'min_samples_leaf': 1},
  mean: 0.81040, std: 0.02205, params: {'max_depth': 11, 'n_estimators': 40, 'min_samples_split': 9, 'min_samples_leaf': 1},
  mean: 0.81544, std: 0.03660, params: {'max_depth': 11, 'n_estimators': 50, 'min_samples_split': 9, 'min_samples_leaf': 1},
  mean: 0.81711, std: 0.03318, params: {'max_depth': 11, 'n_estimators': 100, 'min_samples_split': 9, 'min_samples_leaf': 1},
  mean: 0.81879, std: 0.03385, params: {'max_depth': 11, 'n_estimators': 30, 'min_samples_split': 12, 'min_samples_leaf': 1},
  mean: 0.82383, std: 0.03247, params: {'max_depth': 11, 'n_estimators': 40, 'min_samples_split': 12, 'min_samples_leaf': 1},
  mean: 0.81879, std: 0.02934, params: {'max_depth': 11, 'n_estimators': 50, 'min_samples_split': 12, 'min_samples_leaf': 1},
  mean: 0.81879, std: 0.02623, params: {'max_depth': 11, 'n_estimators': 100, 'min_samples_split': 12, 'min_samples_leaf': 1},
  mean: 0.82047, std: 0.03782, params: {'max_depth': 11, 'n_estimators': 30, 'min_samples_split': 3, 'min_samples_leaf': 3},
  mean: 0.82215, std: 0.03849, params: {'max_depth': 11, 'n_estimators': 40, 'min_samples_split': 3, 'min_samples_leaf': 3},
  mean: 0.81544, std: 0.03191, params: {'max_depth': 11, 'n_estimators': 50, 'min_samples_split': 3, 'min_samples_leaf': 3},
  mean: 0.81208, std: 0.04449, params: {'max_depth': 11, 'n_estimators': 100, 'min_samples_split': 3, 'min_samples_leaf': 3},
  mean: 0.81711, std: 0.03149, params: {'max_depth': 11, 'n_estimators': 30, 'min_samples_split': 6, 'min_samples_leaf': 3},
  mean: 0.82718, std: 0.04505, params: {'max_depth': 11, 'n_estimators': 40, 'min_samples_split': 6, 'min_samples_leaf': 3},
  mean: 0.82047, std: 0.04752, params: {'max_depth': 11, 'n_estimators': 50, 'min_samples_split': 6, 'min_samples_leaf': 3},
  mean: 0.81208, std: 0.02881, params: {'max_depth': 11, 'n_estimators': 100, 'min_samples_split': 6, 'min_samples_leaf': 3},
  mean: 0.82886, std: 0.03168, params: {'max_depth': 11, 'n_estimators': 30, 'min_samples_split': 9, 'min_samples_leaf': 3},
  mean: 0.81208, std: 0.02690, params: {'max_depth': 11, 'n_estimators': 40, 'min_samples_split': 9, 'min_samples_leaf': 3},
  mean: 0.81711, std: 0.03256, params: {'max_depth': 11, 'n_estimators': 50, 'min_samples_split': 9, 'min_samples_leaf': 3},
  mean: 0.81879, std: 0.03856, params: {'max_depth': 11, 'n_estimators': 100, 'min_samples_split': 9, 'min_samples_leaf': 3},
  mean: 0.80872, std: 0.04616, params: {'max_depth': 11, 'n_estimators': 30, 'min_samples_split': 12, 'min_samples_leaf': 3},
  mean: 0.81040, std: 0.04085, params: {'max_depth': 11, 'n_estimators': 40, 'min_samples_split': 12, 'min_samples_leaf': 3},
  mean: 0.81879, std: 0.03103, params: {'max_depth': 11, 'n_estimators': 50, 'min_samples_split': 12, 'min_samples_leaf': 3},
  mean: 0.81879, std: 0.03719, params: {'max_depth': 11, 'n_estimators': 100, 'min_samples_split': 12, 'min_samples_leaf': 3},
  mean: 0.81544, std: 0.05650, params: {'max_depth': 11, 'n_estimators': 30, 'min_samples_split': 3, 'min_samples_leaf': 5},
  mean: 0.81376, std: 0.04937, params: {'max_depth': 11, 'n_estimators': 40, 'min_samples_split': 3, 'min_samples_leaf': 5},
  mean: 0.81208, std: 0.04099, params: {'max_depth': 11, 'n_estimators': 50, 'min_samples_split': 3, 'min_samples_leaf': 5},
  mean: 0.82047, std: 0.04274, params: {'max_depth': 11, 'n_estimators': 100, 'min_samples_split': 3, 'min_samples_leaf': 5},
  mean: 0.81040, std: 0.04311, params: {'max_depth': 11, 'n_estimators': 30, 'min_samples_split': 6, 'min_samples_leaf': 5},
  mean: 0.81208, std: 0.04002, params: {'max_depth': 11, 'n_estimators': 40, 'min_samples_split': 6, 'min_samples_leaf': 5},
  mean: 0.81711, std: 0.03810, params: {'max_depth': 11, 'n_estimators': 50, 'min_samples_split': 6, 'min_samples_leaf': 5},
  mean: 0.80872, std: 0.03959, params: {'max_depth': 11, 'n_estimators': 100, 'min_samples_split': 6, 'min_samples_leaf': 5},
  mean: 0.82718, std: 0.04519, params: {'max_depth': 11, 'n_estimators': 30, 'min_samples_split': 9, 'min_samples_leaf': 5},
  mean: 0.81544, std: 0.03456, params: {'max_depth': 11, 'n_estimators': 40, 'min_samples_split': 9, 'min_samples_leaf': 5},
  mean: 0.82383, std: 0.03927, params: {'max_depth': 11, 'n_estimators': 50, 'min_samples_split': 9, 'min_samples_leaf': 5},
  mean: 0.82047, std: 0.04765, params: {'max_depth': 11, 'n_estimators': 100, 'min_samples_split': 9, 'min_samples_leaf': 5},
  mean: 0.81040, std: 0.03286, params: {'max_depth': 11, 'n_estimators': 30, 'min_samples_split': 12, 'min_samples_leaf': 5},
  mean: 0.80872, std: 0.03188, params: {'max_depth': 11, 'n_estimators': 40, 'min_samples_split': 12, 'min_samples_leaf': 5},
  mean: 0.81208, std: 0.03811, params: {'max_depth': 11, 'n_estimators': 50, 'min_samples_split': 12, 'min_samples_leaf': 5},
  mean: 0.81711, std: 0.04045, params: {'max_depth': 11, 'n_estimators': 100, 'min_samples_split': 12, 'min_samples_leaf': 5}],
 {'max_depth': 7,
  'min_samples_leaf': 1,
  'min_samples_split': 12,
  'n_estimators': 40},
 0.8322147651006712)

根据结果重新修改值计算

from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
alg = RandomForestClassifier(random_state=1,n_estimators=40,min_samples_split=15,min_samples_leaf=1,max_depth=7)
#建立模型
alg.fit(x_train,y_train.values.ravel())

y_pred = alg.predict(x_test)
cnf_matrix =confusion_matrix(y_test,y_pred)
#把小数位数改为2位
np.set_printoptions(precision=2)
print(accuracy_score(y_test,y_pred))

cnf_matrix

输出结果如下：

0.820338983051
Out[33]:
array([[165,  19],
       [ 34,  77]], dtype=int64)

特征值的重要程度预算：

from sklearn.feature_selection import SelectKBest,f_classif
import matplotlib.pyplot as plt
perdictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","Name","Ticket"]
selector = SelectKBest(f_classif,k=5)
a= selector.fit(titanic[predictors],titanic["Survived"])
#可以直接用这个画图
print(a.scores_)
scores = -np.log10(selector.pvalues_)
plt.bar(range(len(predictors)),scores)
plt.xticks(range(len(predictors)),predictors)

在这里插入图片描述

集成算法之随机森林

猜你喜欢