各种分类器的调用

前言：主要介绍sklearn里面的各种分类器使用方法。

0、分类器支持参数的细节

核函数：
1. Linear核：主要用于线性可分的情形。参数少，速度快，对于一般数据，分类效果已经很理想了。
2. RBF核：主要用于线性不可分的情形。参数多，分类结果非常依赖于参数。

一、K邻近算法

from sklearn.neighbors import NearestNeighbors

它是使用近似算法的，所以与实际手工码的结果有少量差别。
1、商铺定位项目的实践证明：kneighbors函数、predict_proba预测概率函数。相互间有很大的关系，但是并不完全依赖，需要认为是两者是相互独立的。
所以说，有分别构建出两种不同的概率矩阵。

二、支持向量机

数据的类型要求：密度的用numpy.ndarray或numpy.asarray、稀疏的用scipy.sparse。加速：使用C-ordered的numpy.ndarray或 dtype=float64的scipy.sparse.csr_matrix 可以快一些。

SVC与LinerSVC的比较：LinerSVC会可能比SVC效果好许多的。linerSVC可以使用SMO算法，支持软化，速度很快，但不能用预测概率。SVC可以使用核函数，预测概率，不过速度很慢。

使用linerSVC会比SVC快，因为调用的底层库不同。而里面也提到梯度下降法的损失函数与linerSVC的一致，也是可以作为代替的。

svm的优点：参考文献
（1）在高纬度的效率高，包括处理维度数比样本数还大的数据。
（2）内存效率高，只使用训练点的子集作为决定函数（支持向量）。
（3）可以指定各种各样的核函数，核函数影响决定函数。
svm缺点：
（1）维度数大于样本数，尽管效率高，但得到的预测结果不好。
（2）它不直接提供概率评估，而是使用高消耗的五折验证计算。

svc = SVC(decision_function_shape='ovo', class_weight='balanced')
svc.fit(all_train_vecs_nor, y_train_data)
predict = svc.predict(unlabeled_vecs_nor)  # 这个代表正则化后的test集合向量。

class sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=None, random_state=None)

1、各个参数的解释、翻译。

C : float, optional (default=1.0)
惩罚系数。
kernel : string, optional (default=’rbf’)
核函数，包括‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’。如果输入矩阵数据，将会自动转化为核。
degree : int, optional (default=3)
多项式核函数(‘poly’).的度数。
gamma : float, optional (default=’auto’)
三种核的系数‘rbf’, ‘poly’ and ‘sigmoid’，默认1/n_features
coef0 : float, optional (default=0.0)
两种核的偏置量 ‘poly’ and ‘sigmoid’.
probability : boolean, optional (default=False)
是否使用概率估计，这会让fit方法速度减慢。控制输出是概率，还是直接一个类别。
shrinking : boolean, optional (default=True)
是否使用shrinking启发式算法。（能预知哪些变量对应着支持向量，则这些变量可保持不动，只对其他变量进行优化，训练时间大大降低。）
tol : float, optional (default=1e-3)
停止条件的容许值。
cache_size : float, optional
核的缓存大小 (in MB)，对训练速度无影响。
class_weight : {dict, ‘balanced’}, optional
默认是平衡，也就是哪一类的输入数目越多，权重反而越小，最终实现平衡。
verbose : bool, default: False
是否输出每一轮的情况。但是可能会影响多线程上下文。
max_iter : int, optional (default=-1)
最多循环训练多少轮，-1代表无限制。
decision_function_shape : ‘ovo’, ‘ovr’ or None, default=None
决策函数的方式，用于多分类 ‘ovo’代表一个类与另一类最终投票决定共n_classes * (n_classes - 1) / 2 个分类器； ‘ovr’一个类与剩下所有类比较共n_classes 个分类器。
random_state : int seed, RandomState instance, or None (default)
在使用概率评估时，它可以用来扰动数据，这样就有属于哪一种的概率可言了。
epsilon : float, optional (default=0.1)
只在回归问题如SVR类才用到，用于多大的误差是相当于没有误差的。

2、属性（直接获取的）:–未翻译

support_ : array-like, shape = [n_SV]
Indices of support vectors.
support_vectors_ : array-like, shape = [n_SV, n_features]
Support vectors.
n_support_ : array-like, dtype=int32, shape = [n_class]
Number of support vectors for each class.
dual_coef_ : array, shape = [n_class-1, n_SV]
Coefficients of the support vector in the decision function. For multiclass, coefficient for all 1-vs-1 classifiers. The layout of the coefficients in the multiclass case is somewhat non-trivial. See the section about multi-class classification in the SVM section of the User Guide for details.
coef_ : array, shape = [n_class-1, n_features]
Weights assigned to the features (coefficients in the primal problem). This is only available in the case of a linear kernel.
coef_ is a readonly property derived from dual_coef_ and support_vectors_.
intercept_ : array, shape = [n_class * (n_class-1) / 2]
Constants in decision function.

三、随机梯度下降分类器

参考文献
1、优缺点
优点：通常使用在文本分类、自然语言处理的问题。能够处理大样本、特征数目多。高效、调参方便。
缺点：要求一些超参数（比如，正则化参数、迭代的数目），并且它对特征的缩放比较敏感。

2、随机梯度下降-处理分类问题-线性模型
注意：
（1）要确保你fit训练集之前，让shuffle=True 来在每一轮迭代后进行洗牌。
（2）loss=”log” 和loss=”modified_huber” 更适合多分类问题。
（3）当设置average=True，就变成了averaged SGD (ASGD)，区别是：使用朴素SGD的coefficients平均值。因此学习率可以设置大一些加速训练的时间。

from sklearn.linear_model import SGDClassifier
X = [[0., 0.], [1., 1.]]
y = [0, 1]
clf = SGDClassifier(loss="hinge", penalty="l2")# loss是计算损失的方法、penalty代表惩罚修正的方法。
clf.fit(X, y)
print clf.predict([[2., 2.]])# 输出 array([1])
print clf.coef_ # coefficients 系数。
print clf.intercept_ # 同名：拦截、截断、窃听、抵消或偏见，偏移量。
print clf.decision_function([[2., 2.]]) # 得到各点到超平面的带符号的距离。

3、分类问题的参数调优

class sklearn.linear_model.SGDClassifier(loss='hinge', penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, n_iter=5, shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, random_state=None, learning_rate='optimal', eta0=0.0, power_t=0.5, class_weight=None, warm_start=False, average=False)

loss：损失。str, ‘hinge’[线性SVM], ‘log’[逻辑回归，概率分类器], ‘modified_huber’[容忍异常值，概率分类器], ‘squared_hinge’[hinge加了平方], ‘perceptron’[线性感知机算法], 回归问题的loss[对应算法看SGDRegressor章节]: ‘squared_loss’, ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’
penalty：惩罚，又名正则项。str, ‘none’, ‘l2’, ‘l1’, or ‘elasticnet’ # l2 是一个标准正则项针对线性SVM模型的，而l1、elasticnet是在特征选择时使用的，即使得模型稀疏，这个l2无法实现。
alpha : float #与正则化项相乘的常数。如果learning_rate设置为optimal模式，该值同时作为optimal模式公式的一个变量值。
l1_ratio : float # 混合惩罚项的比例。如果设置为0，就相当于只使用l2，如果设置为1，就相当于只使用l1.
fit_intercept : bool #是够要偏移。默认是要偏移的。如果你的数据已经居中了，就不用偏移。
n_iter : int, optional# 多少次经过训练集，即迭代多少次。如果是部分训练 partial_fit，迭代次数会设置为1，否则默认为5.
shuffle : bool, optional# 是否每次迭代都要洗牌一次。处理分类问题时记得保留默认的True。
random_state : int seed, RandomState instance, or None (default)# 随机种子，自己设置一个具体值最好。
verbose : integer, optional #冗余
epsilon : float# 只有loss选择这3个才起阈值的作用：‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’
n_jobs : integer, optional# 设置为-1代表使用全部相乘
learning_rate : string, optional# 学习率，有3种选择：‘constant’、‘optimal’、‘invscaling’
eta0 : double# 当learning_rate选择‘constant’、‘invscaling’这两种模式之一，才会起到作用。因为这两种模式的公式参数会使用到该值。
power_t : double# 当learning_rate选择’invscaling‘才会起到作用。因为该模式的公式参数会使用到该值。
class_weight : dict, {class_label: weight} or “balanced” or None, optional # 分类问题，每一类的权重。如果不给出来，就全部权重为1.如果选择“balanced”，那么训练集中出现得越多，它的权重越低，这样预测出来的结果会趋向平均。
warm_start : bool, optional #热启动。也就是以前fit训练过一次，如果设置为True，那么会利用以前fit的结果作为初始值，然后进行再一次的fit。
average : bool or int, optional# 平均。如果设置为True，就会计算平均的SGD权重，并保存在coef_ 属性中。如果设置为大于1的整数，那么每读取这么多的训练样本，就会计算一次平均值。

4、两个属性：

coef_ : array, shape (1, n_features) if n_classes == 2 else (n_classes, n_features) #特征的权重、系数。

intercept_ : array, shape (1,) if n_classes == 2 else (n_classes,)# 决策函数中的常量。截距、偏移量的意义。

5、方法参考文献

decision_function(X)# 用于预测样本的相信程度，该相信程度等于距离超平面的带符号距离。
densify()# 转化系数矩阵为稠密的array格式。
fit(X, y[, coef_init, intercept_init, sample_weight])# 训练线性SGD模型。coef_init热启动的系数初始值、intercept_init热启动的偏移值、sample_weight每个样本的权重。
fit_transform(X[, y])# 训练样本并转化它。输入各种各样的参数，并且转化它。
get_params([deep])# 得到评估器的参数。
partial_fit(X, y[, classes, sample_weight])# 用偏爱的训练线性SGD模型。classes
predict(X)# 给出无标签的X预测y
predict_log_proba、predict_proba # 这两个方法基本一样，loss=log或者modified Huber 才有效。两者区别：在展示概率的时候，是否加多一步log。
score(X, y[, sample_weight])# 返回平均准确度。
set_params(*args, **kwargs)# 设置参数。
sparsify()# 转化系数矩阵为稀疏矩阵格式。

四、随机森林分类器

主要影响参数是：n_estimators棵树、max_features最大特征数。记得把random_state设定为固定值。【随机森林的好处是不容易过拟合，所以计算资源够的话就大量用就行，只不过也有性能上界，所以有时候也打不过CNN。】

rfc = RandomForestClassifier(n_estimators=80,n_jobs=-1,random_state=123)# 这个n_job最好不要用-1（核心全用），你选择一个你cpu核心数一般的值就好了，我的服务器用25就好。
rfc.fit(train_vecs_nor, y_train)
predict = rfc.predict(unlabeled_vecs_nor)

随机森林参数调优（一博客推荐抓住：max_features最大特征数、n_estimators子树的个数，汇总子树的结果，通过投票或平均产生最终结果、min_sample_leaf最小叶节点个数、max_leaf_nodes最大叶子个数，这与纯度有关的，比较有用）参考文献

class sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_split=1e-07, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)

1、n_estimators：子树的个数# 在利用最大投票数或平均值来预测之前，你想要建立子树的数量。较多的子树可以让模型有更好的性能，但同时让你的代码变慢。你应该选择尽可能高的值，只要你的处理器能够承受的住，因为这使你的预测更好更稳定。
2、criterion：标准。# 两个选择：gini（基尼不纯度）、entropy（信息熵）参考文献
3、max_features：最大特征数 #作用：它的增加一般能提高模型的性能，因为在每个节点上，我们有更多的选择可以考虑。然而，这未必完全是对的，因为它降低了单个树的多样性，而这正是随机森林独特的优点。但是，可以肯定，你通过增加max_features会降低算法的速度。
4、min_samples_split：最小纯度要求（个数） #如果不纯度达到这么高，那么必须要进行分裂
5、min_sample_leaf：最少叶节点的个数 #如果您以前编写过一个决策树，你能体会到最小样本叶片大小的重要性。叶是决策树的末端节点。较小的叶子使模型更容易捕捉训练数据中的噪声。一般来说，我更偏向于将最小叶子节点数目设置为大于50。在你自己的情况中，你应该尽量尝试多种叶子大小种类，以找到最优的那个。
6、max_depth：一棵树的最深高度。# 如果为None，就知道全部的叶子节点都是纯的，或者纯度小于最小纯度要求】
7、min_weight_fraction_leaf：最小叶节点加权分数 # 叶节点的全部分数之和要达到这么大，不然该树会被舍弃。默认为0.
8、max_leaf_nodes：最大叶子个数。# 这与纯度有关的，比较有用
9、min_impurity_split：最小纯度要求（阈值）# 如果不纯度达到这么高，那么必须要进行分裂
10、bootstrap：引导。# 默认是True 进行引导样本来构造树。
11、oob_score：超出字典的分数。# 默认False，也就是不用超出字典的样例来评估泛化精度。
12、n_jobs：工作的数目。# 默认为1.如果为-1，就是设置为当前计算机的核心数，进行并行运算。推荐使用-1，如果机器允许的话。自己的R720用25个。
13、random_state：随机状态。# 如果你想每次结果都不一样，那就默认None，如果你想固定下来，那就随便写一个数值上去。
14、verbose：冗长。# 控制树构造程序的冗长情况。默认为0
15、warm_start：热启动。# 如果设置为True，代表你使用以前的训练结果，然后加上新增加的评估estimators 组成一个系统ensemble。默认False，构造新的随机深林。
16、class_weight：类的权重。# 表示看不懂，大致是设计y有多少个类别，权重可以调。

随机森林的方法

1、apply(X) #n个样本，返回n个对应叶节点的目录.
2、decision_path(X) # 返回决策树在森林中的路径
3、fit(X, y[, sample_weight])# sample_weight代表各个样本的权重是多大，可以为不写。返回森林。
4、fit_transform(X[, y])# 通过自己选择随机森林的参数。返回新的训练集X。
5、get_params([deep])# 返回评估器的参数。
6、predict(X)# 返回子树加权投票后，概率最高的结果。
7、predict_log_proba(X)# 预测的时候，概率汇总的时候加log。这样子，最高的概率也不会很高。其实就是利用log特性缓和一下。
8、predict_proba(X)# 通过概率的平均值，进行预测。以上3种，只是预测的方法稍有不同而已。
9、score(X, y[, sample_weight])# 返回准确度的均值。
10、set_params(**params)# 设置参数，允许初始化修改参数。

随机森林例子

书签，看项目例子就行。参考文献

五、神经网络

from sklearn.neural_network import MLPClassifier
X = [[0., 0.], [1., 1.]]
y = [0, 1]
clf = MLPClassifier(solver='lbfgs', alpha=1e-5,hidden_layer_sizes=(5, 2), random_state=1)
clf.fit(X, y)

神经网络参数调优

class sklearn.neural_network.MLPClassifier(hidden_layer_sizes=(100, ), activation='relu', solver='adam', alpha=0.0001, batch_size='auto', learning_rate='constant', learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08)

hidden_layer_sizes : tuple,格式。第i个元素代表了第i个隐藏层中的神经元数目。length元组长度 = n_layers总层数 - 2 的输入层、输出层
activation : {‘identity’, ‘logistic’, ‘tanh’, ‘relu’}, default ‘relu’
Activation function for the hidden layer.
‘identity’, no-op activation, useful to implement linear bottleneck, returns f(x) = x
‘logistic’, the logistic sigmoid function, returns f(x) = 1 / (1 + exp(-x)).
‘tanh’, the hyperbolic tan function, returns f(x) = tanh(x).
‘relu’, the rectified linear unit function, returns f(x) = max(0, x)
solver : {‘lbfgs’, ‘sgd’, ‘adam’}, default ‘adam’
The solver for weight optimization.
‘lbfgs’ is an optimizer in the family of quasi-Newton methods.
‘sgd’ refers to stochastic gradient descent.
‘adam’ refers to a stochastic gradient-based optimizer proposed by Kingma, Diederik, and Jimmy Ba
Note: The default solver ‘adam’ works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, ‘lbfgs’ can converge faster and perform better.
alpha : float, optional, default 0.0001
L2 penalty (regularization term) parameter.
batch_size : int, optional, default ‘auto’
Size of minibatches for stochastic optimizers. If the solver is ‘lbfgs’, the classifier will not use minibatch. When set to “auto”, batch_size=min(200, n_samples)
learning_rate : {‘constant’, ‘invscaling’, ‘adaptive’}, default ‘constant’
Learning rate schedule for weight updates.
‘constant’ is a constant learning rate given by ‘learning_rate_init’.
‘invscaling’ gradually decreases the learning rate learning_rate_ at each time step ‘t’ using an inverse scaling exponent of ‘power_t’. effective_learning_rate = learning_rate_init / pow(t, power_t)
‘adaptive’ keeps the learning rate constant to ‘learning_rate_init’ as long as training loss keeps decreasing. Each time two consecutive epochs fail to decrease training loss by at least tol, or fail to increase validation score by at least tol if ‘early_stopping’ is on, the current learning rate is divided by 5.
Only used when solver=’sgd’.
max_iter : int, optional, default 200
Maximum number of iterations. The solver iterates until convergence (determined by ‘tol’) or this number of iterations.
random_state : int or RandomState, optional, default None
State or seed for random number generator.
shuffle : bool, optional, default True
Whether to shuffle samples in each iteration. Only used when solver=’sgd’ or ‘adam’.
tol : float, optional, default 1e-4
Tolerance for the optimization. When the loss or score is not improving by at least tol for two consecutive iterations, unless learning_rate is set to ‘adaptive’, convergence is considered to be reached and training stops.
learning_rate_init : double, optional, default 0.001
The initial learning rate used. It controls the step-size in updating the weights. Only used when solver=’sgd’ or ‘adam’.
power_t : double, optional, default 0.5
The exponent for inverse scaling learning rate. It is used in updating effective learning rate when the learning_rate is set to ‘invscaling’. Only used when solver=’sgd’.
verbose : bool, optional, default False
Whether to print progress messages to stdout.
warm_start : bool, optional, default False
When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.
momentum : float, default 0.9
Momentum for gradient descent update. Should be between 0 and 1. Only used when solver=’sgd’.
nesterovs_momentum : boolean, default True
Whether to use Nesterov’s momentum. Only used when solver=’sgd’ and momentum > 0.
early_stopping : bool, default False
Whether to use early stopping to terminate training when validation score is not improving. If set to true, it will automatically set aside 10% of training data as validation and terminate training when validation score is not improving by at least tol for two consecutive epochs. Only effective when solver=’sgd’ or ‘adam’
validation_fraction : float, optional, default 0.1
The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if early_stopping is True
beta_1 : float, optional, default 0.9
Exponential decay rate for estimates of first moment vector in adam, should be in [0, 1). Only used when solver=’adam’
beta_2 : float, optional, default 0.999
Exponential decay rate for estimates of second moment vector in adam, should be in [0, 1). Only used when solver=’adam’
epsilon : float, optional, default 1e-8
Value for numerical stability in adam. Only used when solver=’adam’

六、Adaboost算法

P李航书138 P西瓜177 同质（全部使用决策树或神经网络等）的才有基学习方法。它是弱分类器间强依赖的，所以是串行的。

提升方法有两个核心的问题：
1、每一轮如何改变训练数据的权值或概率分布：
“重赋权重法”改变训练数据的概率分布（训练数据的权值分布）。提高前一轮弱分类器错误分类样本的权重。
“重采样法”如果调整不了权值分布就用它。
两法没有明显优劣，后者还能避免因为新建的基学习器不好而导致的过早停止。

2、如何将弱分类器组合成一个强分类器：
加权投票法。全部分类器都遍历，不过会加大分类误差小的弱分类器权重，使其发挥更大的贡献。

七、xgboost

1、安装。需要git、mingw不用vs编译。

参考1 参考2 参考3
这三个安装地址都是很好的。
注意：如果报错：”mingw32-make command not found”。这是还未配置系统变量导致的，必须先关闭git，再重新配。不关闭git来配是不可以的。
最后我实践成功的过程：全程参考来源。下载mingw32-make.exe,并改名为make。（我不知道我安装时改名字了就行，不改就报错：”did you intall compilers and run build.sh”。）
但是我用上面的方法去用到540的win10-i7的台式机时，却是不行的，所以打算用linux了。

import xgboost as xgb
params = {
    'booster':'gbtree',
    'objective':'multi:softmax',
    'num_class':3,
    'eta':0.5,
    'max_depth':5,
    'subsample':0.6,
    'min_child_weight':5,
    'colsample_bytree':0.8,
    'scale_pos_weight':2,
    'eval_metric':'merror',
    # 'gamma':0.2,
    # 'lambda':300,
    'silent':1
}
dtrain = xgb.DMatrix(X_train, y_train)# 80%的训练集
dval = xgb.DMatrix(X_val, y_val)# 20%的训练集。即验证集。
evallist  = [(dval,'eval'), (dtrain,'train')]
dtest = xgb.DMatrix(X_unlabeled)# Test集，也叫X_unlabeled。
num_round = 500
bst = xgb.train( params, dtrain, num_round, evallist, early_stopping_rounds=30, verbose_eval = True )    # xgboost训练
predict = bst.predict(dtest)# 最终的结果。一般是：多行一列。

2、官方提供的纯理论

（1）提升树（作者的意思是，xgb基于boosted trees的理论而制造的库，深度思考了：挖掘系统优化和机器学习原理）。参考来源
（2）DART booster飞镖助推器理论是加入了深度学习的dropout思想，避免了：早期加入的树重要，而后加入的树不重要。参考来源

3、XGBoost的三种参数（一般参数、booster参数、任务参数）调优**

参考来源

A. 一般参数（决定哪种booster被使用在上升的过程中，常用的booster有树模型（tree）和线性模型（linear model））

解决过拟合化问题：第一步，你应该调节：max_depth, min_child_weight、gamma 这三个参数，第二步加入随机性subsample, colsample_bytree，第三步降低eta，并提高num_round。

结果不平衡问题：如果你在意AUC排列顺序（我猜是结果的排序顺序），那就调整参数 scale_pos_weight。如果你在意预测的精确度，那就调大max_delta_step（可以等于1），来帮助收敛。

booster：提升器# 有三种选择gbtree、dart、gblinear前两种为树模型，最后一个是线性模型。
silent：无记录# 为0代表会输出运行的信息，1代表不输出。
nthread：并行数目。# 提高速度的。
num_pbuffer：程序自动设置，缓冲的大小。记录上一轮预测的结果。# 这是由程序自动设置的，用户无须管。
num_feature：程序自动设置，特征维度的大小。# 这是由程序自动设置的，用户无须管。

B. 树提升器的参数 Tree Booster

eta：学习率。
gamma：最小损失分裂。 #如果比较小，那么叶子节点就会不断分割，越具体细致。如果比较大，那么算法就越保守。
max_depth：最大树的深度。# 数值越大，模型越复杂、越具体，越有可能过拟合过。
min_child_weight：子节点最小的权重。# 重要参数。如果叶子节点切分低于这个阈值，就会停止往下切分的了。该参数数值越大，就越保守，越不会过拟合。与gamma有相关性，但是gamma关注损失，而这关心自身权重。
max_delta_step：最大delta的步数# 数值越大，越保守。一般这个参数是不需要的，但它能帮助语料极度不平衡的逻辑回归模型。
subsample：子样本数目 # 是否只使用部分的样本进行训练，这可以避免过拟合化。默认为1，即全部用作训练。
colsample_bytree：每棵树的列数（特征数）。# 默认为1
colsample_bylevel：每一层的列数（特征数）。#默认为1
lambda：L2正则化的权重。# 增加该数值，模型更加保守。
alpha ：L1正则化的权重。# 增加该数值，模型更加保守。
tree_method：树构造的算法。# 默认auto，使用启发式选择最快的一种。如果是中小型数据，就用exact 准确算法，大型数据就选择approx 近似算法。
sketch_eps：# 只用在approx 算法的，用户一般不用调。调小可以获得更准确的序列。
scale_pos_weight：用在不均衡的分类。# 默认为1，还有一种经典的取值是： sum(negative cases) / sum(positive cases)
updater：更新器。# 如果更新树，是一个高级参数。程序会自动选择。当然用户也能自己选择。只是有很多种选择，我看不懂具体的。
refresh_leaf：# 当updater= refresh才有用。设置为False时，叶子节点不更新，只更新中间节点。
process_type：程序运行方式。# 两种选择，默认是default，新建一些树。而选择update，代表基于已有的树，并对已有的树进行更新。

C. 飞镖提升器的参数 Dart Booster。（比树提升器多5个参数）

sample_type：选样方式。# 两种选择。uniform 代表一样的树进行舍弃，weighted 代表选择权重比例的树进行舍弃。
normalize_type：正则化方式。# 两种选择。tree、forest，区别在于计算新树的权重、舍弃树的计算，两种公式不同。
rate_drop：舍弃的比例。# 舍弃上一轮树的比例。
one_drop：# 当值不为0的时候，至少有一棵树被舍弃。
skip_drop：有多少的概率跳过舍弃树的程序。

D. 线性提升器的参数 Linear Booster（只有3个）

lambda：L2正则化的权重。# 增加该数值，模型更加保守。
alpha ：L1正则化的权重。# 增加该数值，模型更加保守。
lambda_bias：L2正则化的偏爱。# 不知道数值大小的影响作用。

E. Tweedie回归的参数（它应该是与上述三种提升器是同一等级的）

tweedie_variance_power：# 数值接近2代表接近gamma分布问题，接近1代表接近泊松分布。

F. 任务参数

objective：reg:linear线性回归、reg:logistic逻辑回归、binary:logistic逻辑回归处理二分类（输出概率）、binary:logitraw逻辑回归处理二分类（输出分数）、count:poisson泊松回归处理计算数据（输出均值、max_delta_step参数默认为0.7）、multi:softmax多分类（需要设定类别的个数num_class）、multi:softprob多分类（与左侧的一样，只是它的输出是ndata*nclass，也就是输出输入各类的概率）、rank:pairwise处理排位问题、reg:gamma用γ回归（返回均值）、reg:tweedie用特威迪回归。

base_score：初始化时，会预测各种类别的分数。# 当迭代足够多的次数，改变这个值是没有什么用的。

eval_metric：评估分数的机制。# 用户可以加新评估机制进来的。主要的：rmse针对回归问题、error针对分类问题、map（Mean average precision）针对排位问题。

seed：种子。# 默认为0。随机数的意思。

除外还有命令行参数，我觉得不会怎么用，就不完全写上了。参考来源

use_buffer：数据加载的缓冲大小。
num_round：运行的次数。

4、XGBoost的方法 API

第一个是快速使用，第二个是具体api讲解（下面摘录的都是第二个网址内容） # 参考1 参考2

核心数据的构造方法

（1）

DMatrix(data, label=None, missing=None, weight=None, silent=False, feature_names=None, feature_types=None)# 构造XGB的数据矩阵。

feature_names：属性，获得特征名，即列编号。
feature_types：属性，获得特征类型，即列的类型。
get_base_margin：返回基础边缘，是一个float值。
get_float_info：输入str类型的信息名称，返回浮点数的array类型，。
get_label：返回标签。
get_uint_info：输入str的名称，返回无符号的整数的array类型。
get_weight：返回权值。# 每一列的权重可能不同。
num_col：得到列的总数，即特征数目。
num_row：得到行的总数，即数据条数。
save_binary：保存DMatrix数据，到XGB的缓冲中。
set_base_margin：设定初始的基础边界。# 可以指定一个预测值作为基础边界。
set_float_info：设定float类型的属性到DMatrix中。
set_group：设定DMatrix每个组的大小。
set_label：设定标签信息。
set_uint_info：设定无符号int类型的属性到DMatrix中。
set_weight：设定权重。
slice：重选一些列，返回新的DMatrix。

（2）

Booster(params=None, cache=(), model_file=None)# XGB的具体模型，涉及训练、预测、评估

attr：用str类型，返回全部属性。
attributes：用字典类型，返回全部属性。
boost：使用自定制的梯度统计方法，消耗一次迭代进行提升booster。
copy：复制一个booster。
dump_model：导出模型。
eval：评估模型，返回str类型的结果。
eval_set：评估一集合的数据。
get_dump：返回卸下的模型及名称。
get_fscore：返回每个特征的重要性。
get_score：同上。但多了一个importance_type参数，可以选择是weight、gain、cover。
get_split_value_histogram：用直方图表示一个特征切分后的值。
load_model：加载模型。
load_rabit_checkpoint：通过加载通管器检查点，来初始化模型。
predict(data, output_margin=False, ntree_limit=0, pred_leaf=False)：进行预测。# 多线程不安全。如果你想多线程预测，就先复制 bst.copy()，再 predict。 ntree_limit 限制使用多少棵树进行预测，特别地这个值可以有train函数处理验证集后得到。output_margin 是否输出原始为转化的边界值。pred_leaf 预测的叶子节点是哪一些。
save_model：保存模型到文件中。
save_rabit_checkpoint：保存当前的通管器检查点。
save_raw：保存模型到缓冲中。
set_attr：设置属性。
set_param：设置参数。
update：进行一次迭代的更新。

API 功能方法

（1）训练模型。返回一个训练好的booster模型。

xgboost.train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None, maximize=False, early_stopping_rounds=None, evals_result=None, verbose_eval=True, xgb_model=None, callbacks=None, learning_rates=None)

params (dict)：参数。
dtrain (DMatrix)：给XGB训练用的数据。
num_boost_round (int)：运行多少次迭代。
evals (list of pairs (DMatrix, string))：这是由pari元素组成的list，可显示性能出来的验证集。
obj (function)：自定制的目标函数。
feval (function)：自定制的评估函数。
maximize (bool)：是否要验证集得分最大。
early_stopping_rounds (int)：可在一定的迭代次数内准确率没有提升（evals列表的全部验证集都没有提升）就停止训练。# 使用 bst.best_ntree_limit 可以得到真实的分数。
evals_result (dict)：通过字典找特定验证集evals 分数。# 例如验证集evals = [(dtest,’eval’), (dtrain,’train’)]，并且在params中设定，{‘eval_metric’: ‘logloss’}。就可根据str找分数 {‘train’: {‘logloss’: [‘0.48253’, ‘0.35953’]}, ‘eval’: {‘logloss’: [‘0.480385’, ‘0.357756’]}}
verbose_eval (bool or int)：是否显示每次迭代的情况。# 如果设置为True就代表每次都显示，False都不显示，一个正整数（如4）就代表每4轮输出一次。
learning_rates (list or function (弃用 - use callback API instead))# 如果是一个list，就代表每一轮的学习率是多少。此外params中也有一个学习率eta，但是eta只能是一个浮点数，不够这个具体。
xgb_model (file name of stored xgb model or ‘Booster’ instance)：通过名字加载已有的XGB模型。
callbacks (list of callback functions)：在每次迭代后的处理。是一个list类型。# 可以预设定参数（暂时不会用） [xgb.callback.reset_learning_rate(custom_rates)]

交叉验证

xgboost.cv(params, dtrain, num_boost_round=10, nfold=3, stratified=False, folds=None, metrics=(), obj=None, feval=None, maximize=False, early_stopping_rounds=None, fpreproc=None, as_pandas=True, verbose_eval=None, show_stdv=True, seed=0, callbacks=None, shuffle=True)

其他拓展方法。

实践感觉：sklearn太慢了，一直没出结果。别用吧！
scikit-learn 包装好的XGB针对回归问题的。（各种参数就不用自己去挑）

xgboost.XGBRegressor(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, objective='reg:linear', nthread=-1, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, seed=0, missing=None)

scikit-learn 包装好的XGB针对分类问题的。（各种参数就不用自己就不用自己想）evals_result评估结果、fit训练。

xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, objective='binary:logistic', nthread=-1, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, seed=0, missing=None)

5、XGBClassifier的举例

param_dist = {'objective':'binary:logistic', 'n_estimators':2}
clf = xgb.XGBClassifier(**param_dist)
clf.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)], eval_metric='logloss', verbose=True)
evals_result = clf.evals_result()# evals_result 将包括: {'validation_0': {'logloss': ['0.604835', '0.531479']},'validation_1': {'logloss': ['0.41965', '0.17686']}}

输出图像。三种都可以的。

xgboost.plot_importance(booster, ax=None, height=0.2, xlim=None, ylim=None, title='Feature importance', xlabel='F score', ylabel='Features', importance_type='weight', grid=True, **kwargs)

xgboost.plot_tree(booster, fmap='', num_trees=0, rankdir='UT', ax=None, **kwargs)

xgboost.to_graphviz(booster, fmap='', num_trees=0, rankdir='UT', yes_color='#0000FF', no_color='#FF0000', **kwargs)

6、XGBoost的代码举例

主要针对它的功能函数方法使用。参考来源里面有Features Walkthrough 功能介绍，等代码例子。

import numpy as np
import scipy.sparse
import pickle
import xgboost as xgb
dtrain = xgb.DMatrix('../data/agaricus.txt.train')
dtest = xgb.DMatrix('../data/agaricus.txt.test')
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
watchlist  = [(dtest,'eval'), (dtrain,'train')]
num_round = 2
bst = xgb.train(param, dtrain, num_round, watchlist)

specify validations set to watch performance

watchlist  = [(dtest,'eval'), (dtrain,'train')] # 用于观察性能
num_round = 2
bst = xgb.train(param, dtrain, num_round, watchlist)

this is prediction

preds = bst.predict(dtest)
labels = dtest.get_label()
print ('error=%f' % ( sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds))))
bst.save_model('0001.model')

dump model

bst.dump_model('dump.raw.txt')

dump model with feature map

bst.dump_model('dump.nice.txt','../data/featmap.txt')

save dmatrix into binary buffer

dtest.save_binary('dtest.buffer')

save model

bst.save_model('xgb.model')

load model and data in

bst2 = xgb.Booster(model_file='xgb.model')
dtest2 = xgb.DMatrix('dtest.buffer')
preds2 = bst2.predict(dtest2)

assert they are the same

assert np.sum(np.abs(preds2-preds)) == 0

alternatively, you can pickle the booster

pks = pickle.dumps(bst2)

load model and data in

bst3 = pickle.loads(pks)
preds3 = bst3.predict(dtest2)

assert they are the same

assert np.sum(np.abs(preds3-preds)) == 0

build dmatrix from scipy.sparse

print ('start running example of build DMatrix from scipy.sparse CSR Matrix')
labels = []
row = []; col = []; dat = []
i = 0
for l in open('../data/agaricus.txt.train'):
    arr = l.split()
    labels.append(int(arr[0]))
    for it in arr[1:]:
        k,v = it.split(':')
        row.append(i); col.append(int(k)); dat.append(float(v))
    i += 1
csr = scipy.sparse.csr_matrix((dat, (row,col)))
dtrain = xgb.DMatrix(csr, label = labels)
watchlist  = [(dtest,'eval'), (dtrain,'train')]
bst = xgb.train(param, dtrain, num_round, watchlist)

print ('start running example of build DMatrix from scipy.sparse CSC Matrix')

we can also construct from csc matrix

csc = scipy.sparse.csc_matrix((dat, (row,col)))
dtrain = xgb.DMatrix(csc, label=labels)
watchlist  = [(dtest,'eval'), (dtrain,'train')]
bst = xgb.train(param, dtrain, num_round, watchlist)

print ('start running example of build DMatrix from numpy array')

NOTE: npymat is numpy array, we will convert it into scipy.sparse.csr_matrix in internal implementation

then convert to DMatrix

npymat = csr.todense()
dtrain = xgb.DMatrix(npymat, label = labels)
watchlist  = [(dtest,'eval'), (dtrain,'train')]
bst = xgb.train(param, dtrain, num_round, watchlist)