Random Forest Information & Entropy & Information Gain

What is random forest?

What are the characteristics of random forest?

What can random forests do?

What is the Random Forest Principle?

 

What is random forest?

As a new and highly flexible machine learning algorithm, Random Forest (RF) has a wide range of application prospects.

Random forest is an algorithm that integrates multiple trees through the idea of ​​ensemble learning. Its basic unit is a decision tree, and its essence belongs to a major branch of machine learning - the ensemble learning (Ensemble Learning) method.

 

There are two keywords in the name of random forest, one is "random" and the other is "forest". We understand "forest" very well. If one tree is called a tree, then hundreds or thousands of trees can be called a forest. This analogy is still very appropriate. In fact, this is also the main idea of ​​random forest - the embodiment of the idea of ​​integration.

 

In fact, from an intuitive point of view, each decision tree is a classifier (assuming that it is a classification problem), then for an input sample, N trees will have N classification results. The random forest integrates all the classification voting results, and assigns the category with the most votes as the final output, which is the simplest Bagging idea.

Random Forest Features

  • It is unexcelled in accuracy among current algorithms;
  • It runs efficiently on large data bases;
  • It can handle input samples with high-dimensional features without dimensionality reduction/It can handle thousands of input variables without variable deletion;
  • It gives estimates of what variables are important in the classification;
  • During the generation process, it generates an internal unbiased estimate of the generalization error as the forest building progresses;
  • It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing

 

Information & Entropy & Information Gain

Information & Entropy & Information Gain

Entropy represents the degree of information confusion. The larger the entropy, the more chaotic the information, and the smaller the entropy, the more certain the information is.

h(x) = -(p(x)*lgp(x)+(1-p(x))*lg(1-p(x))) 

entropy graph

 

Prove that when p is 0 and 1, the entropy is the smallest, when p = 0.5, the entropy is the largest, and the information is the most chaotic. The so-called creation of a decision tree is the process of gradually reducing the entropy. The root node entropy is greater than the average entropy of child nodes.

 

Information gain, information entropy of root node - information entropy of a child node.

gain(Y,f1)= H(Y) - H(Y|f1)

 Information gain rate:

gain(Y,f1)=H(Y,f1) / H(f1) is used as a classification criterion, and C4.5 appears

 H(f1) describes the information entropy of the feature itself. The larger the value, the larger the entropy of the feature itself.

 

Gini coefficient:

Both the information entropy and the Gini coefficient are intended to define the degree of confusion of the variable.

 

The variance defines the degree of variation of the variable, the greater the variance, the higher the degree of dispersion. Use variance to measure stability or not.

When thinking about the Gini coefficient, the above formula needs to be thought about.

 

Decision tree evaluation

It is to sum the entropy of all leaf nodes. The smaller the value is, the more accurate the sample classification is. Since the smaller the value of the function, the better, it is also called the loss function.

 

 

 

 

 

 

 

 

How to build a decision tree?

Choosing the first feature for classification is the most important asynchrony in building the tree. So how to choose?

How to choose features as the first classification?

 The feature with the largest information gain. The greater the feature information gain, the greater the impact on the classification.

 

Random forest is a commonly used method to measure feature importance.

Calculate the nodes passed by the positive example, use the data of the resulting nodes, the GINI coefficient and other indicators of the passing nodes. Or randomly replace a series of data, rebuild the decision tree, and calculate the change in accuracy of the new model to account for important rows of features.

uh, it's not

 

 

 

 

about some questions

How to judge whether to use information gain or Gini coefficient?

Building a tree with information gain and Gini coefficient is similar, but found that using Gini coefficient is often a little better than information gain in scenarios, but may not be correct.

Why the loss function of classification must use cross entropy, while the loss function of regression uses squared error, is it because the classification label cannot be used as the output solution of the classification problem?

Yes.

Is entropy generally used for classification problems?

Yes. Generally, cross entropy is used for classification loss.

If evaluating a tree, how important is that factor to the outcome?

Use information entropy or Gini coefficient to determine influence importance. Randomson is really sure.

 

Is there any replacement when randomly selecting features?

No, the feature does not have to be repeatedly selected, but the sample does.

 

XCB and GBDT are a little better than random forest, right?

uncertain. All algorithms are proven to be effective in practice. Algorithms are harder, but not necessarily better.

Generally speaking, using convolutional neural network classification will be better than random forest, but if the amount of data is not large or the features are more obvious, random forest may have achieved better results, which can achieve similar XCB and GBDT.

Does the number of random forests appear to be overfitting?

No, the more trees there are, the less likely they will be over-fitting. If the accuracy does not increase, dividing the tree will lead to a waste of memory.

Are train_plit_test split samples random?

is random.

Do different features of random forest also need to be subsetted?

Yes. It is possible to do a subset of the features as well. If there are 20 candidate features, the general practice is to randomly make a set, such as randomly selecting 15, and then select the best feature from the 15, that is, at a given feature sampling rate.

The random sampling rate of features can effectively prevent overfitting.

 

How does random forest solve the coupling relationship between two features?

Random forests do not consider coupling relationships. Do it directly.

Wouldn't it be better to use a neural network directly?

Consider price/performance, especially with deep neural networks.

How much is the appropriate feature selection ratio for each time? Can a specific value be specified?

There is no suitable statement for the value. If you find that the model works well and you want to improve the generalization ability, the ratio value can be smaller, or even 0.1, 0.2

SR: I saw a sentence, for each node of each tree, m features are randomly selected, how do you understand it?

Yes, it is the approach of random forest, which randomly selects the best feature from m features.

Features and data determine the line of machine learning problems, and models and algorithms can only approximate this line as much as possible, so XCB and GBDT are sometimes not necessarily better than logistic.

Yes. XCB and GBDT can theoretically get this line, but random forest may not get this line.

How do different decision trees in random forest vote when doing classification?

Random forest classification is the minority obeying the majority. To do regression is to add directly.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324594970&siteId=291194637