Machine Learning notes 3_Adaboost

In general, Ensemble model is suitable for over-fitting models, including bagging and boosting.

3.1 Bagging

Bagging wherein each classifier is trained separately, and then the average composition or method of voting, Boosting method is strongly dependent on the presence of the classifier before, after deconstruction will affect the predicted before a classifier classifier. Random Forests is bagging DT's.
At the same depth, random forests and not much better than the tree, but it makes the classification results smoother

3.2 Boosting

The aim is to improve the performance boosting (improving weak classifier) ​​weak classifiers iterative method

boosting architecture as follows:
First, a first classifier to obtain \ (F_1 (X) \) , followed by the second classifiers \ (f_2 (x) \) to Help \ (F_1 (X) \) , if \ (f_2 (x) \) and (f_1 (x) \) \ like, it may be limited to help, we hope \ (f_2 (x) \) is (f_1 (x) \) \ supplement. Used when bagging classification of different original dataset resampled data set obtained, while boosting, the use of different data sets dataset original data set is then multiplied by a weight \ ({U ^ (I)} \) , resulting in a total loss of function is

\[L(f) = \sum_{i=1}^{m} u^{(i)} l(f(x^{(i)}),\hat{y}^{(i)}) \]

Wherein \ (l (f (x ^ {(i)}), \ hat {y} ^ {(i)}) \) represents any predictive value and a measure of the true value of the loss function.
adaboost idea is, assuming k weak classifiers \ (F_1, F_2, ..., F_k \) , with a first group of data sets with weight \ (\ {x ^ {( i)}, \ hat { y} ^ {(i)} , u ^ {(i)} _ 1 \} \) training \ (F_1 \) , then change each training data weights give \ (U ^ {(I)} _ 2 \) , in this case a new a weighted data set \ (\ {x ^ {( i)}, \ hat {y} ^ {(i)}, u ^ {(i)} _ 2 \} \) in \ (F_1 \ ) Performance will deteriorate on, this time training \ (f_2 \) such that the new data set in \ (f_2 \) Performance on getting better. The so-called poor performance, is the correct rate, error rate, using the following formula to calculate the error rate :

\[\epsilon_1 = \frac{ \sum_{i=1}^{m}u^{(i)}_1\delta(f_1(x^{(i)})\not=\hat{y}^{(i)})}{Z_1}\]

Wherein \ (\ delta (f_1 (x ^ {(i)}) \ not = \ hat {y} ^ {(i)}) \) is the sign function, i.e., both of 1 and not equal to 0 otherwise, \ (Z_1 \) said normalized weights:
\ [Z_1 = \ sum_ {I =. 1} ^ {m} U ^ {(I)} _. 1 \]
this is because the weight data for each set of training used in heavier it is not 1, so the need for normalization. This equation does not look very good to understand why the error rate can represent it? Let us take an example, suppose a weight of 1, this time \ (Z_1 = m \) , i.e. \ (m \) samples, the denominator is the \ (m \) , assuming that the classifier \ (F_1 \ ) let \ (T \) samples misclassified, upper case \ (\ epsilon_1 \) molecule expression is \ (T \) , \ (\ FRAC} m {T} {\) of course, it is the error rate. Another angle may also be understood from the viewpoint of probability, \ (\ ^ U {{FRAC (I). 1}} _ {Z_1} \) may indicate the probability of prediction error for each sample, and then a weighted average is the first classifier the error rate Rights (this interpretation is a little rough, only help to understand).
Note that for the binary error rate \ (\ epsilon_1 <0.5 \) , multiple classification word \ (\ epsilon_1 <1 / K \), Below the formula only two types of briefing. Next, the data set from the weight from \ (u ^ {(i) } _ 1 \) is changed to \ (U ^ {(I)} _ 2 \) , such that
\ [\ frac {\ sum_ { i = 1} ^ { m} u ^ {(i)
} _ 2 \ delta (f_1 (x ^ {(i)}) \ not = \ hat {y} ^ {(i)})} {Z_2} = 0.5 \] this is because, two sorter worst random guess also has a 50% accuracy rate, we make the \ (F_1 \) performance deteriorates, let \ (F_1 \) of \ (u ^ {(i) } _ 2 \) weight training data \ (\ {x ^ {( i)}, \ hat {y} ^ {(i)}, u ^ {(i)} _ 2 \} \) error rate can be raised to 0.5.

How to put \ (f_1 \) error rate upgrade it, that is, let \ (f_1 \) classification effect of variation of it? Very simple way is to let the classifier \ (f_1 \) for the correct classification of those data, we give it less weight, classification error for the data to more weight. An example of the image as follows:
Here Insert Picture Description
the right to re-assume the beginning are 1, 3, 4, in which the first set of training data classified correctly, that \ (f_1 \) error rate is 0.25, then, for the correct classification of data, we the training set of data to take the weight down to \ (1 / \ sqrt3 \) , the right to misclassification of a second set of data re-raised to \ (\ sqrt3 \) at this time \ (f_1 \) error rate is increased to 0.5 a. Summed up the law is this:

  • If \ (f_1 \) can correct a data classification, that is, \ (f_1 (the X-^ {i}) \ not = \ {} the y-^ {(i)} \ Hat) , the right to put the new training data re \ (u ^ {(i)} _ 2 \) is reduced to the original weight divided by \ (D_1 \) : \ (U ^ {(I)} _ 2 = U ^ {(I)} _. 1 / D_1 \)
  • 如果\(f_1\)错误分类某个数据,也就是\(f_1(x^{i})=\hat{y}^{(i)}\),则把新训练数据的权重\(u^{(i)}_2\)增大为原来的权重乘\(d_1\): \(u^{(i)}_2=u^{(i)}_1 d_1\)
    \(d_1\)怎么求呢?
    \[ \frac{ \sum_{i=1}^{m}u^{(i)}_2\delta(f_1(x^{(i)})\not=\hat{y}^{(i)})}{Z_2}=0.5\]
    其实就是把上式中\(u^{(i)}_2\)分类错误的展开成\(u^{(i)}_1d_1\),分类正确的展开乘\(u^{(i)}_1/d_1\)然后解方程即可,经过一系列化简可以得到:
    \[\sum_{f_1(x^{(i)})=\hat{y}^{(i)}} u^{(i)}_1/d_1=\sum_{f_1(x^{(i)})\not=\hat{y}^{(i)}} u^{(i)}_1d_1\]

也就是那些分类错误的数据的权重的和要等于分类正确的数据的权重的和,在经过推导得到\(d_1\)为:
\[d_1 = \sqrt{\frac{1-\epsilon_1}{\epsilon_1}}>1\]
从上边可以看出,要更新数据的权重时,对于上一轮迭代分类错误的数据,我们要增加权重也就是乘一个数,对于上一轮迭代分类正确的数据,我们要减小权重也就是除一个数,有没有办法都用乘法表示呢?只需要把权重取对数即可,因为取对数后原来的函数仍然保持之前的单调性,除此之外,取完对数,之前的乘除运算就可以变成加减运算。上述\(d_1\)取对数后就变成:
\[a_1 = \log d_1 = \log (\sqrt{\frac{1-\epsilon_1}{\epsilon_1}})= \frac{1}{2}\log \frac{1-\epsilon_1}{\epsilon_1}\]
这样更新的权重就都可以用乘法来表示

  • 如果\(f_1\)能正确分类某个数据,权重\(u^{(i)}_2\)减小为\(u^{(i)}_2=u^{(i)}_1 e^{-a_1}\)
  • 如果\(f_1\)错误分类某个数据,权重\(u^{(i)}_2\)增加为\(u^{(i)}_2=u^{(i)}_1 e^{a_1}\)
    用一个式子表示\(u^{(i)}_2\)的更新就是
    \[u^{(i)}_2 = u^{(i)}_1 \times \exp(-\hat{y}^{(i)}f_1(x^{(i)})a_1)\]

当预测值和真实值相同时,\(-\hat{y}^{(i)}f_1(x^{(i)})=-1\)否则取1

综上,二分类的Adaboost算法就可以概况为:
输入:\(\{(x^{(1)},\hat{y}^{(1)},u^{(1)}_1),(x^{(2)},\hat{y}^{(2)},u^{(2)}_1),...,(x^{(m)},\hat{y}^{(m)},u^{(m)}_1)\}\)
其中\(y^{(i)}=\pm 1\), 初始权重为\(u^{(i)}_1=1\)

  1. For weak classifier \ (T = 1,2, ..., T \) A. Using a weighted \ (\ {u ^ {( 1)} _ 1, u ^ {(2)} _ 1, ..., u ^ {(m)} _ 1 \} \) of the data train a weak classifier t b. calculating the t-th weak classifier classification error rate \ (\ \) epsilon_t (formulas see above) c. calculated by classifying error rate 'importance by number \ (a_t \) D a. \ (U ^ {(I)} _ 2 = U ^ {(I)} _. 1 \ Times \ exp (- \ Hat {Y} ^ {(I)} F_1 (x ^ {(i)} ) a_1) \) weight update data set weight
  2. The training resulting in a series of weak classifiers \ (f_1, f_2, ...,
    f_T \) last strong classifier to as:
    \ [H (X) = Sign (\ sum_ {T} = ^ T a_t F_T. 1 ( X)) \]
    where \ (a_t \) is calculated by the front of the error rate \ (a_t \)

Combination strategy is interpreted intuitively right classifier weights higher error rate, whereas a lower weight.

So far adaboost finished

Guess you like

Origin www.cnblogs.com/cuiyirui/p/11920681.html