Feature selection method for nodes in decision tree

1. Information gain

  The information gain is used in the ID3 decision tree, and the information gain is a value determined according to the change value of entropy.

  Entropy: A measure of the magnitude of uncertainty in a random variable. The greater the entropy, the greater the uncertainty of the variable.

  The formula for entropy says:

         The probability distribution of X is P(x=x i ) = p i , i=1,2,3...(x possible values), the random variable X entropy is , and 0log0=1.

        Conditional entropy: H(Y|X) represents the uncertainty of random variable Y under the condition of random variable X.

        In the decision tree, Y is the data set, and X is a feature, that is, the conditional entropy is the entropy of the data set under the condition of feature A division.

        Information gain: The difference between the entropy H(D) of dataset D and the conditional entropy H(D|A) of D under the given condition of feature A. g(D|A)=H(D)-H(D|A)

        Therefore, the feature selection method when dividing nodes according to the information gain decision is: for the training data set D, calculate the information gain of every feature, compare their sizes, and select the feature with the largest information gain.

The information gain ratio

        Taking the information gain as the feature of dividing the data set, there is a problem of biasing the selection of features with more values. At this time, the information gain ratio can be used to correct this problem. The C4.5 decision tree is based on the information gain ratio for feature selection and node segmentation.

        Definition of information gain ratio: The information gain ratio of feature A to training set D is defined as the ratio of the information gain g(D|A) to the value entropy of data set D with respect to feature A.

        Formula definition: , where .

        Regarding the question of why the information gain ratio can correct the information gain, it is biased to select features with more values. It can be considered that when a feature has a large number of values, many sub-nodes will be divided. When the data set is not very large, each sub-node Nodes can only have a small amount of data. The law of large numbers satisfies the conditions even worse, and cannot reflect the distribution of the overall data set, thus reducing the uncertainty. For example, this feature has N values, and the data set has exactly N For example, each node has only one data, so the entropy of each point is 0, so it adds up to 0, so the information gain is the largest. However, when the dataset is sufficiently large, this problem does not exist.

3. Gini Index

        The CART tree is divided into a regression tree and a classification tree. The method of selecting the feature when the node of the CART classification tree selects the feature for splitting is the Gini index.

        In the classification problem, assuming that there are K classes, the probability that the sample point belongs to the kth class is , then the Gini index of the probability distribution is defined as:

        .

        The Gini index Gini(D) represents the uncertainty of the set D, and the Gini index Gini(D, A) represents the uncertainty of the set D after dividing by A=a. The larger the Gini index, the greater the uncertainty of the sample set. large, similar to entropy.

        

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326802929&siteId=291194637