Random Forest (Random Forest, referred to as RF) and Bagging algorithm

Random Forest ( the Random Forest , referred to as the RF )

Random Forest ensemble learning is through the idea of ​​an algorithm integrated multi-tree, which is the basic unit of decision trees, and its essence belongs to a major branch of machine learning - Integrated Learning (Ensemble Learning) method. Name of random forests, there are two key words, one is "random", is a "forest." "Forest" We well understood, is called a tree, then hundreds of forest trees can be called the "integration" reflects the main idea of ​​this is random forests.

Each of them is a decision tree classifier (assume for now that the classification problem), then for a sample input, N trees will have the N classification results. The random forest classification integrates all of the voting results, the number of times the most votes will be designated category for the final output, which is one of the most simple Bagging thought.

Features random forest

In all current algorithms, with excellent accuracy

It can be efficiently run on large datasets

Able to process the input samples having high dimensional feature and does not require dimensionality reduction

To assess the importance of each feature on classification issues

During the build process, it is possible to obtain an internally generated error unbiased estimate

The default value for the problem can be obtained very good results obtained

Integrated learning

Integrated learning to solve the problem through the establishment of a single prediction combination of several models. It works by generating a plurality of classifiers / model, independently learn and make predictions. These predictions are combined into a single final prediction, and therefore better than any single classifier to make a prediction.

Random Forests is a sub-class integrated learning, which relies on voting selection decision tree to determine the final classification result.

Generating a random forest

There are many random forest classification tree. We want to classify a sample input, we need the input sample input to each tree in the classification. Make a vivid metaphor: the forest met to discuss a squirrel or a rat animal in the end, each tree must be independently express their views on this issue, which is to be voted each tree. The animal is a rat or a squirrel in the end, to be determined in accordance with the voting, received the most votes is the category classification result of the forest. Every tree in the forest are independent, 99.9% prediction made irrelevant tree cover all cases, these predictions will cancel each other out. Predictions of a few good tree will be detached from the many "noise" to make a good prediction. The classification results of a number of weak classifiers vote choice, so as to constitute a strong classifier, which is the idea of ​​random forest bagging.

With trees we can classify, and every tree in the forest is how to generate it?

Each tree generated according to the following rules:

1) If the training set of size N, for each tree, the random replacement, and there is extracted from the training set of N training samples (This sampling method is called bootstrap sample), as a training set of the tree; each tree training set is different, and which contains a repeated training samples.

2) if the characteristic dimension of each sample is M, a specified constant m << M, m feature randomly selected subset of the M features from each split tree, selected from optimum m feature of;

3) have the greatest extent of the growth of each tree, and no pruning.

Random Forest classification results (error rate) with two factors:

Two trees in the forest any correlation: the greater the greater the correlation, the error rate;

Each tree forest classification ability: every tree classification ability is stronger, the lower the error rate of the entire forest.

Reducing the number of selection wherein m, correlation and classification tree capacity will be reduced accordingly; increasing m, both also increases. So the key question is how to choose the best m (or range), which is the only one parameter random forest.

Bag The error rate ( OOB error )

We mentioned above, random forest to build the key question is how to choose the optimal m, to solve this problem outside the main basis for the calculation bags error rate oob error (out-of-bag error).

Random Forests have an important advantage is that it is not necessary to carry out cross-validation or an unbiased estimation of an independent test set error obtained. It can be evaluated internally, that is to say can be an unbiased estimate of the errors generated in the process of establishing.

When building each tree, we use a different bootstrap sample (random and with replacement to extract) for the training set. Therefore, for purposes of each tree (assuming for the k-th tree), about 1/3 of the training examples is not involved in generating the k-th tree, which is called the k-th sample oob tree.

And this sampling characteristics allows us to estimate oob, it is calculated as follows:

( Note : in order to sample units)

1) for each sample, calculating classification (about 1/3 of the tree) as its tree oob sample;

2) then a simple majority vote as a result of the classification of the sample;

3) Finally, the number of misclassification error rate as the total number of samples oob Random Forest misclassification rate.

 

Bagging algorithm

Bagging method (English: Bootstrap aggregating, guided aggregation algorithm), also known as bagging algorithm, is a community of learning machine learning algorithm. Bagging algorithm can be used with other classification, regression combination algorithm to improve its accuracy, and stability by reducing the variance of the results, to avoid over-fitting occurs.

Bagging technique generalization error is reduced by combining several models. The main idea is to train several different models, respectively, and then let the test sample of the output of all voting model. This is an example of a conventional machine learning strategy, called model average (modelaveraging). Technology uses this strategy is called integrated approach.

The basic idea

1. Given a weak learning algorithm, and a training set;

2. single weak learning algorithm accuracy is not high;

3. The learning algorithm used multiple times, come to predict a sequence of functions, to vote;

4. The accuracy of the final result will be improved.

Algorithm steps

Given a training set of size n D, Bagging algorithm from uniform, replacement, (i.e. use the self sampling) m is selected subsets of size Di n 'as a new training set. M in this training set classification, regression algorithms, models m can be obtained, then by averaging, taking majority or the like, to get the results of Bagging

Bagging algorithm properties

By reducing the variance 1.Bagging group classifier improved generalization error.

2. its performance is dependent on the stability of the base classifiers; unstable if the base classifier, Bagging help reduce the random fluctuations of the training data error caused; if stable, the error is integrated by the classifier is primarily based classifiers bias cause.

3 is the same as the probability of each sample being selected, thus bagging not focused on any specific example of the training data set.

 

Guess you like

Origin www.cnblogs.com/fd-682012/p/12013067.html