Integrated algorithms (bagging, random forests, boosting)

Integrated learning definitions

The plurality of individual learning are integrated together so that together they complete learning tasks, has the purpose of improving the prediction accuracy rate, also known as "multi-classifier system"
Here Insert Picture Description

Example: do exercises when the subject of accuracy is not high, be checked by more than one student answers, improve the accuracy topics

Divided into two categories:

Bagging bagging, random forest
Upgrade boosting,adaboost,GBDT,XGBoot

Integrated learning process in general

  • Let D represents the original training data, k denotes the number of group classifier (learner group), Z is represented by the test data set.
  • for i = 1 to k do Di create a training set created by the group D by a classifier Di Ci of
  • end for
  • for each test sample do
    C * (X) = Vote (a C1 (X), C2 (X), ..., Ck (X)) End for

How to combine the results of the study

Voting Act Forecast for the classification problem.
Average method For the return value type of prediction. Divided: the average and weighted average method
Learning In order to solve the larger problem of the voting method and the average error of law. After weak learners together with a layer learner

Integrated approach:

  • One is to use different subsets of the training set of training classifiers to obtain a different group. (Bagging)
  • Another approach was the use of different base classifiers trained with different attributes of a subset of the training set. (Random Forests)

Bagging:

With replacement of the training set to extract the training examples, each a base so as to construct a learner are the training set, but rather the size of the training set different from each other, so that different basic training learner; The algorithm is based on the integrated process for processing the training set simplest and most intuitive one.
Algorithmic process:
1. randomly drawn independently n 'data (n' <= n) from the original data set of size n D form a self-data set;

  • Repeat the process, resulting in a self-K independent data set;
  • Use the K self-trained data set of k optimal model;
  • Classification: the result of classification is determined by the k optimal model respective discrimination results vote; regression problem: on the K values ​​are averaged to get the final result of the model.
    Here Insert Picture Description
  • Further since the same probability of each sample is selected, and therefore not focused on any particular bagging instance the training data set. So for noise data, bagging is not affected too fit.
  • There are questions about bagging necessary to mention: bagging at the cost of a single decision trees do not predict specific variables which play an important role becomes unknown, so bagging improves the accuracy of prediction but lost explanatory.

Random Forests

Is a random forest classifier comprising a plurality of decision trees, and whose output class is the number of all types of output by individual tree may be. Is an extension bagging variant
i.e. Bagging + = RF tree

  • The classification results of a number of weak classifiers vote choice, so as to constitute a strong classifier, which is the idea of ​​random forest bagging.
  • There are many random forest classification tree. We want to classify a sample input, we need the input sample input to each tree in the classification.

The main factors affecting the RF performance of classification

  • Classification Strength single forest trees (Strength): the greater the intensity of every single classification tree, the better the Random Forest classification performance.
  • Correlation (Correlation) between forest trees: the greater the degree of correlation between the trees, the Random Forest classification performance worse.

Construction of random forests

  1. Use bootstrap from the original sample set of m samples (random sampling with replacement) m samples selected sampling method;
  2. K selected attributes from all n random attributes (if k = n then constructs a decision tree based decision tree with the same conventional, if K = 1 is selected for dividing a property), so that in general the value of k log2n ;
  3. Select the best split attribute (ID3, C4.5, CART) to create a decision tree as a node;
  4. Each tablet were grown tree maximally not prune;
  5. Repeat steps 4 times S, S particles build decision trees, random forests that formed;
  6. In classification problems majority voting determines the output belongs to which classification by; the average output of all output in the decision tree regression problems.

Random Forests advantage

  • In all current algorithms, with excellent accuracy.
  • Training can be highly parallelized, it can be efficiently run on a large data set, to improve operating speed.
  • Able to process the input samples having high dimensional feature, and you do not need to dimension reduction.
  • To assess the importance of each characteristic in the classification problem (feature selection, Sklearn provides a good tool for us).
  • Deletion of part features insensitive.
  • As a result of random sampling with replacement, small trained variance model, strong generalization ability.

The disadvantage of a random forest

  • More characteristic values ​​will have a greater impact on the RF's decision, it may affect the results of the model.
  • Bagging improved forecast accuracy, but lost explanatory.
  • On some relatively large noise characteristics, RF model is vulnerable to over-fitting.

Random Forest Use

#随机森林
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pylab as plt

train_path = 'car.csv'

# 读取数据文件,如果数据文件放在带有中文字符的路径,read_csv方法需要指定参树#engine='python'
data_frame_train = pd.read_csv(train_path, encoding='gbk')
# 划分训练集和测试集的X,y
X_train, y_train = data_frame_train.values[:, :-1], data_frame_train.values[:, -1]

# 不调整参数的效果(oob_score=True:采用袋外样本来评估模型的好坏,反映了模型的泛化能力)
# 实例化模型
rfclf = RandomForestClassifier(oob_score=True, random_state=10)
# 模型训练
rfclf.fit(X_train, y_train)
# 模型对测试集进行预测
y_pre = rfclf.predict(X_train)   # 预测值
y_prb_1 = rfclf.predict_proba(X_train)[:, 1]  # 预测为1的概率
# 输出oob_score以及auc
print(rfclf.oob_score_) 


Boosting

Boosting main role is similar with the bagging also a number of base classifiers integrated into a classifier method. Boosting
is a sequential process, each successive model will attempt to correct a previous error model. Subsequent model relies on the previous model.

boosting work:
Step 1: Create a subset of the original data from the set.
Step Two: Initially, all of the data points have the same weight.
Step 3: Create the model on the basis of this subset.
Fourth step: This model is used to predict the entire data set.
Step 5: actual value and the predicted value calculation error.
Step Six: Predicting the wrong point to obtain a higher weight (there are three points plus sign to the right of misclassification is given more weight).
Step Seven: Create another model and forecast data set (model attempts to correct a previous error model
Step eight: Similarly, create multiple models, each model is correct previous errors.
Step Nine: Final model (strong learner) are all models (weak learners) a weighted average.
Thus a much weaker binding Boosting learning is formed a strong learner poor performance of a single model, but they behave in some parts of the data set well. Therefore, each model is actually enhance the integration of overall performance.

Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
Lifting is an iterative process, for adaptively changing the distribution of training samples, such that the group classification is difficult to focus on those sample points

Adaboost

  • AdaBoost (Adaptive Boosting, adaptive lifting): the principle of the algorithm is weak learners will be more reasonable combination, making it a strong learner.
  • Adaboost using iterative thinking, inherited Boosting algorithm, each iteration only a weak learners trained, trained weak learner will participate in the next iteration.
  • In other words, the first N iterations, a total of N weak learners, where the N-1 was previously trained, its various parameters will not change, this training N-th learner
  • In which the relationship between the weak learner is the first N weak learners are more likely to divide data before N-1 weak learners no points for the final classification of the output depends on the combined effect of the N classifiers.

Examples:
Here Insert Picture Description
Right 1. Initialize the value of 10 samples. 1 = W1 of / 10
2. The w1, data set D in the training set samples produced Dl
Here Insert Picture Description
3. Training h1 generates a group classification in D1, the classifier most good split point k = 0.75, x≤k, y = -1, x> k, y = 1. This classification is based on the use of the original data set D classification.
4. The classification results of 10 samples, you have three misclassified samples, 7 samples correctly classified, they are correctly classified samples: {} 0.4,0.5,0.6,0.7,0.8,0.9,1, classification error samples: {0.1,0.2,0.3}.
The classification result of the original data D is set, an error measure which:
6. The weight calculating
7. Update the weights in the second round the lifting process, generating weight w2
8. The weight w2 according to the distribution of a sample by generating training D set D2 of
Here Insert Picture Description
9. The generated classifier H2
10. The reciprocation, the results of
Here Insert Picture Description

Published 20 original articles · won praise 23 · views 973

Guess you like

Origin blog.csdn.net/surijing/article/details/104992662