Integrated learning and decision tree learning route

classification

1. Homogeneity and heterogeneity

Containing only the same type of individual learners, such an integration is "homogeneous";
including different types of individual learners, such an integration is "heterogeneous".

2. Bagging & Boosting

Bagging

The original data set is randomly sampled T times to obtain T sub-data sets of the same size as the original data set, and T weak classifiers are trained separately, and then combined into a strong classifier.

过拟合的模型,通常方差variance比较大,这时应该用bagging对其进行修正。

The model in bagging is a strong model with low bias and high variance . The goal is to reduce variance. In bagging, the bias and variance of each model are approximately the same, but the mutual correlation is not very high, so generally the bias cannot be reduced, but the variance can be reduced to a certain extent. Typical bagging is random forest (RF).

Basic process

A) Extract the training set from the original sample set. Each round uses Bootstraping (randomly and with replacement to extract N training samples from the training set) method to extract n training samples (in the training set, some samples may It is drawn multiple times, and some samples may not be drawn at a time). A total of k rounds of extraction are performed to obtain k training sets. (k training sets are independent of each other)
B) One training set is used each time to get a model, A total of k models are obtained from k training sets. (Note: Different classification or regression methods are used according to specific problems, such as decision trees, neural networks, etc.)
C) For classification problems: vote for the k models obtained in the previous step Obtain the classification results; for regression problems, calculate the mean of the above models as the final result.
Insert picture description here

	样本采集均匀有放回、分类器同权、可并行训练

Boosting

Boosting is an ensemble of weak classifiers based on weights, and continuous iterative updates can make the final result infinitely close to the optimal classification

欠拟合的模型,通常偏差bias比较大,这时应该可以用boosting进行修正。使用boosting时, 每一个模型可以简单一些。

Each model in boosting is a weak model with high bias and low variance . The goal is to reduce the bias by averaging. The basic idea of ​​boosting is to use the greedy method to minimize the loss function, which can obviously reduce the deviation, but usually the model is highly correlated, so the variance cannot be significantly reduced. The typical boosting is Adaboost, and the commonly used parallel boosting algorithms are GBDT (gradient boosting decision tree) and XgBoost. This type of algorithm is usually not prone to overfitting.

Basic process

  1. e represents the misclassification rate of a weak classifier, calculated as the credibility weight a of this classifier, and update the sampling weight D.
  2. D represents the weight matrix of the original data, which is used for random sampling. At the beginning, the sampling probability of each sample is the same, which is 1/m. In the classification of a weak classifier, if the classification is wrong or correct, D will increase or decrease accordingly according to e. Then the wrong sample will increase or decrease due to the increase of D, and the probability of being sampled in the next classification sampling will increase, thereby increasing The probability of the last misclassified sample being paired next time.
  3. α is the credibility of the weak classifier, and the implicit α in bagging is 1. In boosting, according to the performance of each weak classifier (e is low), determine the result of this classifier in the total result The weight, the quasi-classified, naturally accounts for more weight.
  4. Finally, according to the credibility α and the estimated h(x) of each weak classifier, the final result is obtained.

Insert picture description here
There are mainly two parts, updating the sampling weight D and calculating the classifier weight α. The former makes the original wrong sample more likely to appear in the next classifier, thereby increasing the probability of matching the original wrong sample. ; The latter assigns different weights to different weak classifiers according to the performance of the classifier, and finally gets a weighted strong classifier.

样本采集依错误率加权,分类器依准确率加权、仅能顺序串行生成。

The Boosting algorithm tends to divide the wrong samples all the time. If there are outlier wrong samples in the sample, the boosting will not work well. It is sensitive to outliers and outliers.

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_42764932/article/details/111387162