GBDT回归实战完全总结(一)

   第一部分：参数说明 
 

 
  （一）、简述 
 

   sklearn自带的ensemble模块中集成了GradientBoostingRegressor的类，参数包括： 
 

 
  class  
  sklearn.ensemble. 
  GradientBoostingRegressor 
  ( 
  loss=’ls’ 
  ,  
  learning_rate=0.1 
  ,  
  n_estimators=100 
  ,  
  subsample=1.0 
  ,  
  criterion=’friedman_mse’ 
  ,  
  min_samples_split=2 
  ,  
  min_samples_leaf=1 
  ,  
  min_weight_fraction_leaf=0.0 
  ,  
  max_depth=3 
  ,  
  min_impurity_decrease=0.0 
  ,  
  min_impurity_split=None 
  ,  
  init=None 
  ,  
  random_state=None 
  ,  
  max_features=None 
  ,  
  alpha=0.9 
  ,  
  verbose=0 
  ,  
  max_leaf_nodes=None 
  ,  
  warm_start=False 
  ,  
  presort=’auto’ 
  ) 
  [source] 
 

 
  （二）、参数含义及可取的值  
 

 
  ●表示可选参数，★ 表示默认参数 
 

   1、loss : {‘ls’, ‘lad’, ‘huber’, ‘quantile’}, optional (default=’ls’)----------损失函数 
 

 
  ● 'lad'（最小绝对偏差）是仅基于输入变量的订单信息的高度可靠的损失函数。 'huber'是两者的结合。 “分位数”允许分位数回归（使用alpha来指定分位数）。 
 

 
  ★ 'ls'是指最小二乘回归 
 

 
  2、 
  learning_rate 
   : float, optional (default=0.1)------- 
  学习率（缩减） 
 

 
  注： 
  即每个弱学习器的权重缩减系数ν，也称作步长，ν的取值范围为0<ν≤1。

 
  3、 
  n_estimators 
   : int (default=100)-------- 
  子模型的数量 
 

 
  ● int：个数 
 

 
  ★ 100：默认值 
 

 
  4、 
  max_depth 
   : integer, optional (default=3)--------- 
  最大深度，如果max_leaf_nodes参数指定，则忽略 
 

 
  ● int：深度 
 

 
  ★ 3：默认值 
 

 
  5、 
  criterion 
   : string, optional (default=”friedman_mse”)--------如何划分特征 
 

 
  注：Supported criteria are “friedman_mse” for the mean squared error with improvement score by Friedman, “mse” for mean squared error, and “mae” for the mean absolute error. 
 

 
  6、 
  min_samples_split 
   : int, float, optional (default=2)------- 
  分裂所需的最小样本数 
 

 
  ● int：样本数 
 

 
  ★ 2：默认值 
 

 
  注：如果为int，则将min_samples_split作为最小值。 
 

 
  如果为float，则min_samples_split为百分比，ceil（min_samples_split * n_samples）为每个分割的最小采样数。 
 

 
  7、 
  min_samples_leaf 
   : int, float, optional (default=1)------- 
  叶节点最小样本数 
 

 
  ● int：样本数 
 

 
  ★ 1：默认值 
 

 
  注：如果为int，则将min_samples_leaf视为最小值。 
 

 
  如果为float，则min_samples_leaf是百分比，ceil（min_samples_leaf * n_samples）是每个节点的最小采样数。 
 

 
  8、 
  min_weight_fraction_leaf 
   : float, optional (default=0.)------ 
  叶节点最小样本权重总值 
 

 
  ● float：权重总值 
 

 
  ★ 0：默认值 
 

 
  注：需要在叶节点处的所有输入样本权重总和的最小加权分数。 没有提供sample_weight时，样本具有相同的权重。 
 

 
  9、 
  subsample 
   : float, optional (default=1.0)------ 
  子采样率 
 

 
  ● float：采样率 
 

 
  ★ 1.0：默认值 
 

 
  注：如果小于1.0，则会导致随机梯度增强（Stochastic Gradient Boosting）。 子样本与参数n_estimators交互。 选择子样本<1.0会导致方差减少和偏差增加。 
  推荐在[0.5, 0.8]之间，默认是1.0，即不使用子采样。

 
  10、 
  max_features 
   : int, float, string or None, optional (default=None)--- 
  节点分裂时参与判断的最大特征数 
 

 
  ● int：个数 
 

 
  ● float：占所有特征的百分比 
 

 
  ● auto：所有特征数的开方 
 

 
  ● sqrt：所有特征数的开方 
 

 
  ● log2：所有特征数的log2值 
 

 
  ★ None：等于所有特征数 
 

 
  注：选择max_features <n_features会导致方差减少和偏差增加。 
 

 
  11、 
  max_leaf_nodes 
   : int or None, optional (default=None)------ 
  最大叶节点数 
 

 
  ● int：个数 
 

 
  ★ None：不限制叶节点数 
 

 
  12、 
  min_impurity_split 
   : float, 
 

 
  Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf. 
 

 
  Deprecated since version 0.19: 
  min_impurity_split 
   has been deprecated in favor of  
  min_impurity_decrease 
   in 0.19 and will be removed in 0.21. Use 
  min_impurity_decrease 
   instead. 
 

 
  13、 
  min_impurity_decrease 
   : float, optional (default=0.)----------停止划分的阈值 
 

 
  注：如果此分割导致大于或等于该值的纯度减少，则节点将被分割。 
 

 
  纯度方程： 
  N_t  
  / 
   N  
  * 
   (impurity  
  - 
   N_t_R  
  / 
   N_t  
  * 
   right_impurity  
  - 
   N_t_L  
  / 
   N_t  
  * 
   left_impurity) 
 

   14、 
  alpha 
   : float (default=0.9)--------- 
  损失函数为huber或quantile的时，alpha为损失函数中的参数 
 

 
  The alpha-quantile of the huber loss function and the quantile loss function. Only if  
  loss='huber' 
   or  
  loss='quantile' 
  . 
 

 
  15、 
  init 
   : BaseEstimator, None, optional (default=None)-------- 
  初始子模型 
 

 
  注：用于计算初始预测的估计器对象。 init必须提供拟合和预测。 如果没有，它使用loss.init_estimator。 
 

 
  16、 
  verbose 
   : int, default: 0------- 
  日志冗长度 
 

 
  ● int：冗长度 
 

 
  ★ 0：不输出训练过程 
 

 
  ● 1：偶尔输出 
 

 
  ● >1：对每个子模型都输出 
 

 
  17、 
  warm_start 
   : bool, default: False-------- 
  是否热启动，如果是，则下一次训练是以追加树的形式进行 
 

 
  ● bool：热启动 
 

 
  ★ False：默认值 
 

 
  18、 
  random_state 
   : int, RandomState instance or None, optional (default=None)--- 
  随机器对象 
 

 
  注：如果是int，random_state是随机数发生器使用的种子; 如果RandomState实例，random_state是随机数生成器; 如果为None，则随机数生成器是np.random使用的RandomState实例。 
 

 
  19、 
  presort 
   : bool or ‘auto’, optional (default=’auto’)--------- 
  是否预排序,预排序可以加速查找最佳分裂点，对于稀疏数据不管用 
 

 
  ● Bool 
 

 
  ★ auto：非稀疏数据则预排序，若稀疏数据则不预排序 
 

 
  注： 默认情况下，自动模式将对密集数据使用预排序，并且默认对稀疏数据进行正常排序。 在稀疏数据上将presort设置为true会引发错误。 
 

 
  （三）、属性Attributes说明 
 

   1、 
  feature_importances_ 
   : array, shape = [n_features]------------输出特征的重要性，越大越重要 
 

 
  2、 
  oob_improvement_ 
   : array, shape = [n_estimators]---------损失函数的提升 
 

 
  3、 
  train_score_ 
   : array, shape = [n_estimators]-------训练的精度 
 

 
  4、 
  loss_ 
   : LossFunction---------输出损失函数 
 

 
  5、 
  init 
   : BaseEstimator------ 
  即我们的初始化的时候的弱学习器，拟合对应原理篇里面的f0(x)，如果不输入，则用训练集样本来做样本集的初始化分类回归预测。否则用init参数提供的学习器做初始化分类回归预测。一般用在我们对数据有先验知识，或者之前做过一些拟合的时候，如果没有的话就不用管这个参数了。 
 

 
  6、 
  estimators_ 
   : ndarray of DecisionTreeRegressor, shape = [n_estimators,  
  loss_.K 
  ]---输出集成器 
 

 
  （四）、Methods 
 

   1、apply(X)-------将集合中的树应用于X，返回叶索引。 
 

   注：Parameters: X : array-like or sparse matrix, shape = [n_samples, n_features] 
 

   Returns: X_leaves : array_like, shape = [n_samples, n_estimators, n_classes] 
 

   2、fit(X, y, sample_weight=None, monitor=None)----拟合the gradient boosting model. 
 

   3、get_params(deep=True) 
 

   4、predict(X)------预测函数 
 

   5、score(X, y, sample_weight=None)-----返回平均精度 
 

   6、set_params(**params)---------输入用于训练的参数，需是字典类型 
 

   7、staged_predict(X) 
 

   注：Predict regression target at each stage for X.

实战部分见下一博文

GBDT回归实战完全总结(一)

猜你喜欢