Machine Learning Series - Bagging and Random Forest

Bagging

There are two main categories of ensemble learning algorithms: one is Boosting, the representative algorithm is AdaBoost; the other is Bagging, the random forest introduced in this article is a variant of it.

Bagging, also known as bootstrap aggregating, re-selects the original data set by sampling with replacement T contains m A new dataset of pieces of data to train the classifier. That is to say, these new datasets are allowed to be repeated. Use the set of trained classifiers to classify new samples, and then count the classification results of all classifiers by majority voting or averaging the output, and the category with the highest result is the final label.

random forest

Random forest is a variant of Bagging, which introduces random attribute selection in the training of decision trees on the basis of building Bagging ensembles with decision trees as the base learner.

The traditional decision tree selects an optimal division attribute on the attribute set of the current node when selecting the division attribute (assuming that there are d attributes), and in random forest, for each node of the decision tree, first randomly select an attribute containing k A subset of attributes, and then select an optimal attribute from this subset to divide. parameters here k controls the degree of randomness: if k=d , then, the construction of the base decision tree is the same as the traditional decision tree; if k=1 , an attribute is randomly selected for division; in general, it is recommended to k = l org2d

Advantages and disadvantages of random forests

Advantages of random forests

  1. On many current datasets, it has great advantages over other algorithms and performs well.
  2. It can handle very high dimensional data without feature selection.
  3. After training, it can give which features are more important.
  4. When creating a random forest, an unbiased estimate is used for the generlization error, and the model has a strong generalization ability.
  5. The training speed is fast, and it is easy to make a parallel method.
  6. During the training process, the mutual influence between features can be detected.
  7. The implementation is relatively simple.
  8. For imbalanced datasets, it can balance the error.
  9. Accuracy can still be maintained if a significant portion of the features are missing.

Disadvantages of random forests

  1. Random forests have been shown to overfit on some noisy classification or regression problems.
  2. For data with attributes with different values, attributes with more value divisions will have a greater impact on random forests, so the attribute weights produced by random forests on such data are not credible.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325923295&siteId=291194637