[Machine Learning] P23 Decision Tree, Entropy and Information Gain

Decision Trees, Entropy, and Information Gain


decision tree

Decision Tree (Decision Tree) is a classification algorithm based on tree structure, which judges which category a data instance belongs to through a series of queries (also called tests or decision conditions).

Taking a case throughout the whole text, we try to judge whether it is a cat or a dog through three characteristics in the database table:

  • Feature 1: Ear shape (folded ears, pointed ears)
  • Feature 2: The shape of the face (round or not)
  • Feature 3: With or without long beards (yes, no)
ear shape face shape with or without long beard is it a cat
tip round have yes
fold not round have yes
fold round none no
tip not round have no
tip round have yes
tip round none yes
fold not round none no
tip round none yes
fold round none no
fold round none no

According to the above table, a construction scheme of decision tree is as follows:

insert image description here

在这里容易产生两个疑问:

  • Why does the root node choose to judge the ear shape instead of the root node first judging the shape of the face?
  • Why judge the shape of the face and not the shape of the beard when the shape of the ear is pointy?

Because of entropy and information gain.


entropy

Entropy (Entropy) is intended to be an indicator of the degree of disorder in the data set, which can be used to measure how many different categories are mixed in the data set. The higher the entropy, the higher the degree of disorder in the data set, that is, the more mixed categories. In the decision tree algorithm, we want to use as few tests as possible to divide the data set, so we will choose those test conditions that can minimize the entropy of the data set, that is, the largest information gain.

In this case, the concept of entropy is the degree of confusion of mixed cats and dogs in a data set. In other words, if a data set is full of cats, the entropy is 0, or if it is all dogs, the entropy is also 0, but if half cats and half dogs, then the maximum value of entropy will be reached: 1.

The formula for entropy is:
H ( p 1 ) = − p 1 log 2 ( p 1 ) − ( 1 − p 1 ) log 2 ( 1 − p 1 ) H(p_1)=-p_1log_2(p_1)-(1-p_1 )log_2(1-p_1)H(p1)=p1log2(p1)(1p1)log2(1p1)

where p 1 p_1p1Represents the proportion of cats in the data set, so ( 1 − p 1 ) (1-p_1)(1p1) is the proportion of dogs in the data set.

Therefore, the entropy image is as follows:

insert image description here

  • p = 0 p=0 p=0 means all dogs
  • p = 1 p=1 p=1 means all cats

According to the data table, the initial value of entropy is:
p 0 = 5 / 10 = 0.5 p_0=5/10=0.5p0=5/10=0.5 H ( p 0 ) = − 0.5 ∗ l o g 2 ( 0.5 ) − ( 1 − 0.5 ) ∗ l o g 2 ( 1 − 0.5 ) = 1 H(p_0)=-0.5*log_2(0.5)-(1-0.5)*log_2(1-0.5)=1 H(p0)=0.5log2(0.5)(10.5)log2(10.5)=1

In the decision tree algorithm, the next step we have to do is to minimize the entropy of the data set, that is, try to distinguish cats from dogs, and the value of each reduction in entropy is the value of information gain.


information gain

Information gain (Information Gain) refers to the reduction in the entropy of a data set after using a certain test condition to divide the data set. The greater the information gain, it means that using the test condition can minimize the entropy of the data set, that is, increase the degree of order of the data set to the greatest extent.

So we are in the initial state, we have three test conditions to divide the data set, thereby reducing the entropy, and our goal is to filter out the choice of maximum information gain from these three test conditions:

  • ear shape
  • face shape
  • with or without long beard

insert image description here
Information Gain = Initial Entropy - Result Entropy Information Gain = Initial Entropy - Result Entropyinformation gain=initial entropyresult entropy

As shown in the figure above, the initial entropy is H ( 0.5 ) = 1 H(0.5)=1H(0.5)=1 , if the shape of the ear is used as the test condition, the information gain is the largest, so we choose the shape of the ear as the first test condition to divide the original data set.

The next division principle is the same as above, and the maximum information gain is used as the judgment condition until the value of entropy is 0;

Of course, in actual situations, there will be some noise points, which will cause the entropy to not be 0, so we often set the depth of the decision tree. Construct decision trees of finite depth.


Python and decision trees

Case import:

X_train = np.array([[1, 1, 1],[0, 0, 1],[0, 1, 0],[1, 0, 1],[1, 1, 1],[1, 1, 0],[0, 0, 0],[1, 1, 0],[0, 1, 0],[0, 1, 0]])

y_train = np.array([1, 1, 0, 0, 1, 1, 0, 1, 0, 0])

Calculate entropy:

The calculation formula of known entropy is:
H ( p ) = − plog 2 ( p ) − ( 1 − p ) log 2 ( 1 − p ) H(p) = -plog_2(p)-(1-p)log_2( 1-p)H(p)=- pl o g2(p)(1p)log2(1p)

The Python program is programmed as:

def entropy(p):
    if p == 0 or p == 1:
        return 0
    else:
        return -p * np.log2(p) - (1- p)*np.log2(1 - p)

Divide the original data set into two parts according to the test conditions:

def split_indices(X, index_feature):
    left_indices = []
    right_indices = []
    for i,x in enumerate(X):
        if x[index_feature] == 1:
            left_indices.append(i)
        else:
            right_indices.append(i)
    return left_indices, right_indices

Compute entropy with weights (result entropy):

def weighted_entropy(X,y,left_indices,right_indices):
    
    w_left = len(left_indices)/len(X)
    w_right = len(right_indices)/len(X)
    p_left = sum(y[left_indices])/len(left_indices)
    p_right = sum(y[right_indices])/len(right_indices)
    
    weighted_entropy = w_left * entropy(p_left) + w_right * entropy(p_right)
    return weighted_entropy

Calculate information gain:

def information_gain(X, y, left_indices, right_indices):
    p_node = sum(y)/len(y)
    h_node = entropy(p_node)
    w_entropy = weighted_entropy(X,y,left_indices,right_indices)
    print(h_node)
    return h_node - w_entropy

Guess you like

Origin blog.csdn.net/weixin_43098506/article/details/130306819