Machine Learning (7) Random Forest, GBDT, Adaboost

Bagging

Bagging strategy:

(1) Resampling (with repetition) from the sample set to select n samples;

(2) On all attributes , establish a classifier (ID3, C4.5, CART, SVM, Logistic regression, etc.) for these n samples;

(3) Repeat the above two steps m times to obtain m classifiers;

(4) Put the data on the m classifiers, and finally decide which category the data belongs to according to the voting results of the m classifiers.

Question 1: How to choose the value of n?

Question 2: How to choose the value of m? - Select an odd number of classifiers.

Note: Instead of understanding Bagging as an algorithm, it is better to understand it as an idea, that is, the idea of ​​synthesizing the results of multiple weak classifiers to obtain a strong classifier!

random forest

Random forest is modified on the basis of bagging. The basic idea is:
(1) Select n samples (resampling) from the sample set with Bootstrap sampling (sampling with replacement);
(2) Randomly select k attributes from all attributes , and select the best segmentation attribute as the node Establish a CART decision tree;
(3) Repeat the above two steps m times, that is, establish m CART (binary tree) decision trees
(4) These m CARTs form a random forest, and determine which category the data belongs to by voting on the results.

The relationship between random forests, bagging and decision trees

The current understanding of Bagging and random forest is different as shown in the red font ; Bagging method selects all feature attributes, and random forest selects k feature attributes (a subset of feature attributes) of all feature attributes.

Of course, decision trees can be used as the basic classifier, but other classifiers such as SVM and Logistic regression can also be used. Traditionally, the "total classifier" composed of these classifiers is still called random forest.

See above from: http://blog.csdn.net/american199062/article/details/51314968

Random Forest in Detail

A decision tree is like a master, classifying new data based on what it has learned in the data set. But as the saying goes, one Zhuge Liang can't play three stooges. Random forest is an algorithm that hopes to build multiple stooges, hoping that the final classification effect can exceed that of a single master.
How is the random forest constructed? There are two aspects: random selection of data, and random selection of features to be selected.

Random selection of data:

First, sampling with replacement is taken from the original data set, and a sub-data set is constructed. The data volume of the sub-data set is the same as that of the original data set. Elements in different subdatasets can be repeated, and elements in the same subdataset can also be repeated. Second, use the sub-dataset to build a sub-decision tree. Finally, if there is new data and the classification result needs to be obtained through the random forest, the output result of the random forest can be obtained by voting on the judgment result of the sub-decision tree. As shown in the figure below, assuming that there are 3 sub-decision trees in the random forest, the classification result of 2 sub-trees is class A, and the classification result of 1 sub-tree is class B, then the classification result of random forest is class A.

3.2.2. Random selection of features to be selected:
Similar to the random selection of the data set, each splitting process of the subtree in the random forest does not use all the features to be selected, but randomly selects certain features from all the features to be selected, and then randomly selects certain features. Select the best feature from among the features. This can make the decision trees in the random forest different from each other, improve the diversity of the system, and thus improve the classification performance.
In the figure below, the blue squares represent all the features that can be selected, that is, the current candidate features. The yellow squares are split features. On the left is the feature selection process of a decision tree. By selecting the optimal splitting feature from the features to be selected (don't forget the ID3 algorithm, C4.5 algorithm, CART algorithm, etc. mentioned above), the splitting is completed. On the right is the feature selection process for a subtree in a random forest.

1. There is a question, how many subtrees are generated? (i.e. how many subsets of data to sample?) Generally an odd number is positive, otherwise voting is a problem.
2. How many features are selected for training of each subtree?

******

提升(boosting)是一种机器学习技术,可以用于回归和分类的问题,它每一步产生弱预测模 型(如决策树),并加权累加到总模型中;如果每一步的弱预测模型的生成都是依 据损失函数的梯度方式的,那么就称为梯度提升(Gradient boosting) 提升技术的意义:如果一个问题存在弱预测模型,那么可以通过提升技术的办法 得到一个强预测模型.

梯度提升算法






1 Adaboost的原理

1.1 Adaboost是什么    

    AdaBoost,是英文"Adaptive Boosting"(自适应增强)的缩写,由Yoav Freund和Robert Schapire在1995年提出。它的自适应在于:前一个基本分类器分错的样本会得到加强,加权后的全体样本再次被用来训练下一个基本分类器。同时,在每一轮中加入一个新的弱分类器,直到达到某个预定的足够小的错误率或达到预先指定的最大迭代次数。

    具体说来,整个Adaboost 迭代算法就3步:

  1. 初始化训练数据的权值分布。如果有N个样本,则每一个训练样本最开始时都被赋予相同的权值:1/N。
  2. 训练弱分类器。具体训练过程中,如果某个样本点已经被准确地分类,那么在构造下一个训练集中,它的权值就被降低;相反,如果某个样本点没有被准确地分类,那么它的权值就得到提高。然后,权值更新过的样本集被用于训练下一个分类器,整个训练过程如此迭代地进行下去。
  3. 将各个训练得到的弱分类器组合成强分类器。各个弱分类器的训练过程结束后,加大分类误差率小的弱分类器的权重,使其在最终的分类函数中起着较大的决定作用,而降低分类误差率大的弱分类器的权重,使其在最终的分类函数中起着较小的决定作用。换言之,误差率低的弱分类器在最终分类器中占的权重较大,否则较小。

1.2 Adaboost算法流程

    给定一个训练数据集T={(x1,y1), (x2,y2)…(xN,yN)},其中实例x \in \mathcal{X},而实例空间\mathcal{X} \subset \mathbb{R}^n,yi属于标记集合{-1,+1},Adaboost的目的就是从训练数据中学习一系列弱分类器或基本分类器,然后将这些弱分类器组合成一个强分类器。

    Adaboost的算法流程如下:

  • 步骤1. 首先,初始化训练数据的权值分布。每一个训练样本最开始时都被赋予相同的权值:1/N。

  • 步骤2. 进行多轮迭代,用m = 1,2, ..., M表示迭代的第多少轮

a. 使用具有权值分布Dm的训练数据集学习,得到基本分类器(选取让误差率最低的阈值来设计基本分类器):

b . 计算Gm(x)在训练数据集上的分类误差率
由上述式子可知,Gm(x)在训练数据集上的 误差率em就是被Gm(x)误分类样本的权值之和
c. 计算Gm(x)的系数,am表示Gm(x)在最终分类器中的重要程度(目的:得到基本分类器在最终分类器中所占的权重):
由上述式子可知,em <= 1/2时,am >= 0,且am随着em的减小而增大,意味着分类误差率越小的基本分类器在最终分类器中的作用越大。
d . 更新训练数据集的权值分布(目的:得到样本的新的权值分布),用于下一轮迭代

使得被基本分类器Gm(x)误分类样本的权值增大,而被正确分类样本的权值减小。就这样,通过这样的方式,AdaBoost方法能“重点关注”或“聚焦于”那些较难分的样本上。

    其中,Zm是规范化因子,使得Dm+1成为一个概率分布:

  • 步骤3. 组合各个弱分类器

从而得到最终分类器,如下:

1.3 Adaboost的一个例子

    下面,给定下列训练样本,请用AdaBoost算法学习一个强分类器。

    

    求解过程:初始化训练数据的权值分布,令每个权值W1i = 1/N = 0.1,其中,N = 10,i = 1,2, ..., 10,然后分别对于m = 1,2,3, ...等值进行迭代。

    拿到这10个数据的训练样本后,根据 X 和 Y 的对应关系,要把这10个数据分为两类,一类是“1”,一类是“-1”,根据数据的特点发现:“0 1 2”这3个数据对应的类是“1”,“3 4 5”这3个数据对应的类是“-1”,“6 7 8”这3个数据对应的类是“1”,9是比较孤独的,对应类“-1”。抛开孤独的9不讲,“0 1 2”、“3 4 5”、“6 7 8”这是3类不同的数据,分别对应的类是1、-1、1,直观上推测可知,可以找到对应的数据分界点,比如2.5、5.5、8.5 将那几类数据分成两类。当然,这只是主观臆测,下面实际计算下这个具体过程。

迭代过程1

对于m=1,在权值分布为D1(10个数据,每个数据的权值皆初始化为0.1)的训练数据上,经过计算可得:

    1. 阈值v取2.5时误差率为0.3(x < 2.5时取1,x > 2.5时取-1,则6 7 8分错,误差率为0.3),
    2. 阈值v取5.5时误差率最低为0.4(x < 5.5时取1,x > 5.5时取-1,则3 4 5 6 7 8皆分错,误差率0.6大于0.5,不可取。故令x > 5.5时取1,x < 5.5时取-1,则0 1 2 9分错,误差率为0.4),
    3. 阈值v取8.5时误差率为0.3(x < 8.5时取1,x > 8.5时取-1,则3 4 5分错,误差率为0.3)。

可以看到,无论阈值v取2.5,还是8.5,总得分错3个样本,故可任取其中任意一个如2.5,弄成第一个基本分类器为:

上面说阈值v取2.5时则6 7 8分错,所以误差率为0.3,更加详细的解释是:因为样本集中

    1. 0 1 2对应的类(Y)是1,因它们本身都小于2.5,所以被G1(x)分在了相应的类“1”中,分对了。
    2. 3 4 5本身对应的类(Y)是-1,因它们本身都大于2.5,所以被G1(x)分在了相应的类“-1”中,分对了。
    3. 但6 7 8本身对应类(Y)是1,却因它们本身大于2.5而被G1(x)分在了类"-1"中,所以这3个样本被分错了。
    4. 9本身对应的类(Y)是-1,因它本身大于2.5,所以被G1(x)分在了相应的类“-1”中,分对了。

从而得到G1(x)在训练数据集上的误差率(被G1(x)误分类样本“6 7 8”的权值之和e1=P(G1(xi)≠yi) = 3*0.1 = 0.3

然后根据误差率e1计算G1的系数:

这个a1代表G1(x)在最终的分类函数中所占的权重,为0.4236。
接着 更新训练数据的权值分布,用于下一轮迭代:

值得一提的是,由权值更新的公式可知,每个样本的新权值是变大还是变小,取决于它是被分错还是被分正确。

即如果某个样本被分错了,则yi * Gm(xi)为负,负负得正,结果使得整个式子变大(样本权值变大),否则变小。

第一轮迭代后,最后得到各个数据的权值分布D2 = (0.0715, 0.0715, 0.0715, 0.0715, 0.0715,  0.0715, 0.1666, 0.1666, 0.1666, 0.0715)。由此可以看出,因为样本中是数据“6 7 8”被G1(x)分错了,所以它们的权值由之前的0.1增大到0.1666,反之,其它数据皆被分正确,所以它们的权值皆由之前的0.1减小到0.0715。

分类函数f1(x)= a1*G1(x) = 0.4236G1(x)。

此时,得到的第一个基本分类器sign(f1(x))在训练数据集上有3个误分类点(即6 7 8)。

    从上述第一轮的整个迭代过程可以看出:被误分类样本的权值之和影响误差率,误差率影响基本分类器在最终分类器中所占的权重

  迭代过程2

对于m=2,在权值分布为D2 = (0.0715, 0.0715, 0.0715, 0.0715, 0.0715,  0.0715, 0.1666, 0.1666, 0.1666, 0.0715)的训练数据上,经过计算可得:

    1. 阈值v取2.5时误差率为0.1666*3(x < 2.5时取1,x > 2.5时取-1,则6 7 8分错,误差率为0.1666*3),
    2. 阈值v取5.5时误差率最低为0.0715*4(x > 5.5时取1,x < 5.5时取-1,则0 1 2 9分错,误差率为0.0715*3 + 0.0715),
    3. 阈值v取8.5时误差率为0.0715*3(x < 8.5时取1,x > 8.5时取-1,则3 4 5分错,误差率为0.0715*3)。

所以,阈值v取8.5时误差率最低,故第二个基本分类器为:

面对的还是下述样本:

很明显,G2(x)把样本“3 4 5”分错了,根据D2可知它们的权值为0.0715, 0.0715,  0.0715,所以G2(x)在训练数据集上的误差率e2=P(G2(xi)≠yi) = 0.0715 * 3 = 0.2143。

计算G2的系数:

更新训练数据的权值分布:
D3  = (0.0455, 0.0455, 0.0455,  0.1667, 0.1667,  0.01667 , 0.1060, 0.1060, 0.1060, 0.0455)。被分错的样本“ 3 4 5 ”的权值变大,其它被分对的样本的权值变小。
f2(x)=0.4236G1(x) + 0.6496G2(x)

此时,得到的第二个基本分类器sign(f2(x))在训练数据集上有3个误分类点(即3 4 5)。

  迭代过程3

对于m=3,在权值分布为D3 = (0.0455, 0.0455, 0.0455, 0.1667, 0.1667,  0.01667, 0.1060, 0.1060, 0.1060, 0.0455)的训练数据上,经过计算可得:

    1. 阈值v取2.5时误差率为0.1060*3(x < 2.5时取1,x > 2.5时取-1,则6 7 8分错,误差率为0.1060*3),
    2. 阈值v取5.5时误差率最低为0.0455*4(x > 5.5时取1,x < 5.5时取-1,则0 1 2 9分错,误差率为0.0455*3 + 0.0715),
    3. 阈值v取8.5时误差率为0.1667*3(x < 8.5时取1,x > 8.5时取-1,则3 4 5分错,误差率为0.1667*3)。

所以阈值v取5.5时误差率最低,故第三个基本分类器为:

依然还是原样本:

此时,被误分类的样本是:0 1 2 9,这4个样本所对应的权值皆为0.0455,

所以G3(x)在训练数据集上的误差率e3 = P(G3(xi)≠yi) = 0.0455*4 = 0.1820。

计算G3的系数:

更新训练数据的权值分布:

D4 = (0.125, 0.125, 0.125, 0.102, 0.102,  0.102, 0.065, 0.065, 0.065, 0.125)。被分错的样本“0 1 2 9”的权值变大,其它被分对的样本的权值变小。

f3(x)=0.4236G1(x) + 0.6496G2(x)+0.7514G3(x)

此时,得到的第三个基本分类器sign(f3(x))在训练数据集上有0个误分类点。至此,整个训练过程结束。

    现在,咱们来总结下3轮迭代下来,各个样本权值和误差率的变化,如下所示(其中,样本权值D中加了下划线的表示在上一轮中被分错的样本的新权值):

  1. 训练之前,各个样本的权值被初始化为D1 = (0.1, 0.1,0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1);
  2. 第一轮迭代中,样本“6 7 8”被分错,对应的误差率为e1=P(G1(xi)≠yi) = 3*0.1 = 0.3,此第一个基本分类器在最终的分类器中所占的权重为a1 = 0.4236。第一轮迭代过后,样本新的权值为D2 = (0.0715, 0.0715, 0.0715, 0.0715, 0.0715,  0.0715, 0.1666, 0.1666, 0.1666, 0.0715);
  3. 第二轮迭代中,样本“3 4 5”被分错,对应的误差率为e2=P(G2(xi)≠yi) = 0.0715 * 3 = 0.2143,此第二个基本分类器在最终的分类器中所占的权重为a2 = 0.6496。第二轮迭代过后,样本新的权值为D3 = (0.0455, 0.0455, 0.0455, 0.1667, 0.1667,  0.01667, 0.1060, 0.1060, 0.1060, 0.0455);
  4. 第三轮迭代中,样本“0 1 2 9”被分错,对应的误差率为e3 = P(G3(xi)≠yi) = 0.0455*4 = 0.1820,此第三个基本分类器在最终的分类器中所占的权重为a3 = 0.7514。第三轮迭代过后,样本新的权值为D4 = (0.125, 0.125, 0.125, 0.102, 0.102,  0.102, 0.065, 0.065, 0.065, 0.125)。

    从上述过程中可以发现,如果某些个样本被分错,它们在下一轮迭代中的权值将被增大,同时,其它被分对的样本在下一轮迭代中的权值将被减小。就这样,分错样本权值增大,分对样本权值变小,而在下一轮迭代中,总是选取让误差率最低的阈值来设计基本分类器,所以误差率e(所有被Gm(x)误分类样本的权值之和)不断降低。

    综上,将上面计算得到的a1、a2、a3各值代入G(x)中,G(x) = sign[f3(x)] = sign[ a1 * G1(x) + a2 * G2(x) + a3 * G3(x) ],得到最终的分类器为:

G(x) = sign[f3(x)] = sign[ 0.4236G1(x) + 0.6496G2(x)+0.7514G3(x) ]。


参考:https://blog.csdn.net/sangyongjia/article/details/53325660


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325721020&siteId=291194637