How to use a decision tree to judge whether to go on a blind date?

Hello everyone, I am Wang Laoshi. In the previous article, we briefly introduced what a decision tree is. In this chapter, we will see how to build a decision tree based on actual problems, and then make decisions.

1. Important factors affecting decision-making, purity and information entropy

Let's first look at a set of data sets:

insert image description here

How do we construct a decision tree for judging whether to go on a blind date?

According to the steps of creating a decision tree, first we need to determine the root node of the decision tree. But among the many influencing factors, each influencing factor must be different, so which attribute should we choose as the root node?

So let's first understand the next two important concepts: purity and information entropy

1.1. Purity

Purity is the size of the difference in our set of data. Another way to explain it is to make the divergence of the target variable smaller. The higher the purity, the less variability.

Collection 1: Go on a blind date 6 times;

Set 2: 4 times to go on a blind date, 2 times not to go on a blind date;

Collection 3: 3 times to go on a blind date, 3 times not to go on a blind date;

According to the purity index, set 1 > set 2 > set 3. Because set 1 has the least divergence and set 3 has the greatest divergence.

1.2. Information entropy

Information entropy is a basic concept of information theory. Describe the uncertainty in the occurrence of each possible event of an information source. In the 1940s, Shannon (CE Shannon) borrowed the concept of thermodynamics, and called the average amount of information after the redundancy was excluded in the information "information entropy". It can be understood as the uncertainty of information.

There is uncertainty in the probability of random discrete events occurring. In order to measure the uncertainty of this information, Shannon, the father of informatics, introduced the concept of information entropy and gave the mathematical formula for calculating information entropy:
insert image description here

p(i|t) represents the probability that node t is class i, where log2 is the logarithm with base 2. The meaning of the formula here means: the greater the uncertainty, the greater the amount of information it contains, and the higher the information entropy .

According to the formula, we calculate the information entropy of the following set

Set 1: The information entropy of meeting 6 times is -6/6log26/6 = 0;

Set 2: 4 times to meet, 2 times not to meet - (4/6log2 4/6+2/6log2 2/6)=0.918

Set 3: 3 times to meet, 3 times not to meet - (3/6log2 3/6+3/6log2 3/6)=1

It can be seen from the above calculation results that the greater the information entropy, the lower the purity . When all samples in the set are uniformly mixed, the information entropy is the largest and the purity is the lowest.

When we construct a decision tree, we will build it based on purity. Commonly used algorithms are information gain (ID3 algorithm), information gain rate (C4.5 algorithm) and Gini index (Car algorithm)

2. ID3 algorithm

2.1. Introduction to ID3 Algorithm

The ID3 algorithm was first proposed by J. Ross Quinlan at the University of Sydney in 1975 as a classification prediction algorithm. The core of the algorithm is "information entropy". The ID3 algorithm calculates the information gain of each attribute, and considers that the attribute with high information gain is a good attribute. Each division selects the attribute with the highest information gain as the division standard, and repeats this process until a decision tree that can perfectly classify training samples is generated.

The ID3 algorithm calculates information gain , which itself is a greedy algorithm. Information gain means that division can bring about an increase in purity and a decrease in information entropy. Its calculation formula is the information entropy of the parent node minus the information entropy of all child nodes. In the calculation process, we will calculate the normalized information entropy of each child node, that is, calculate the information entropy of these child nodes according to the probability of each child node appearing in the parent node. So the formula for information gain can be expressed as

insert image description here

In the formula, D is the parent node, Di is the child node, and a in Gain(D,a) is selected as the attribute of D node.

2.2. Using the ID3 algorithm to construct a decision tree

We calculate the information gain of each attribute according to the formula of information gain to select a more suitable attribute as the root node

7 pieces of data, according to the results, 4 pieces do not meet, and 3 pieces meet, so the information entropy of the root node is:

Entropy(D)=-(4/7log2 4/7+3/7log2 3/7)=0.985

If the economic conditions are classified as attributes, there are three sub-nodes, which are rich, average, and poor. We mark them with D1, D2, and D3, and we use + and - to represent their corresponding meeting results. Then the calculation method of the economic condition as an attribute node is as follows:

economic condition information gain

D1 (economic conditions = rich) = {1-, 2-, 6+}=Ent(D1)=-(2/3log2 2/3+1/3log2 1/3)=0.918

D2 (Economic Conditions=General)={3+,7-}=Ent(D2)=-(1/2log2 1/2+1/2log2 1/2)=1

D3 (economic conditions = no money) = {4+, 5-}=Ent(D2)=-(1/2log2 1/2+1/2log2 1/2)=1

Information entropy normalized 3/7 0.918+2/7 1+2/7*1=0.965

The information gain of economic conditions as attribute nodes is Gain(D, economy)=0.985-0.965=0.020.

In the same way, other attributes are calculated as the information gain of the root node as follows:

height information gain

D1(height=height)={5-}=Ent(D1)=-(1log2 1+1log2 1)=0

D2(height=average)={6+,7–}=Ent(D2)=-(1/2log2 1/2+1/2log2 1/2)=1

D3(height=low)={1-,2-,3+,4+}=Ent(D2)=-(1/2log2 1/2+1/2log2 1/2)=1

The normalized information entropy is 1/7 0+2/7 1+4/7*1=0.857

Information gain is Gain(D, economic conditions)=0.985-0.857=0.128

appearance information gain

D1 (look = handsome) = {3+, 4±, 5-, 6-}=Ent(D1)=-(2/4log2 2/4+2/4log2 2/4)=1

D2 (look = not handsome)={1-,2-,6+}=Ent(D2)=-(2/3log2 2/3+1/3log2 1/3)=0.918

The normalized information entropy is 4/7 1+3/7 0.918=0.965

Information gain is Gain(D, appearance)=0.985-0.965=0.02

The normalized information entropy and information gain data of each attribute as an attribute node are obtained as follows:

Attributes normalized information entropy information gain
economy 0.965 0.02
height 0.857 0.128
look 0.965 0.02
attach 0.965 0.02

Therefore, taking the height as the root node has the largest information gain, and the ID3 algorithm uses the node with the largest information gain as the parent node to obtain a high-purity decision tree. Therefore, taking height as the root node, the shape of the decision tree is as follows

insert image description here

Continue to divide down, continue to divide down with low attribute nodes, and calculate the information gain of different attributes

look

D1 (look = handsome) ={3+, 4+}=Ent(D1)=-(1log2 1 +10log2 0)=0

D2 (look = not handsome)={1-, 2-}=Ent(D2)=-(1log2 1 +10log2 0)=0

Normalized information entropy is 0

Information gain is Gain(D, appearance)=1-0=1

Economic conditions

D1 (economic conditions = rich) = {1-, 2-}=Ent(D1)=-(1log2 1 +1log2 1)=0

D2 (economic conditions = general) ={3+}=Ent(D2)=-(1log2 1 +1log2 1)=0

D3 (economic conditions = no money) = {4+} = Ent(D2) = 0

Normalized information entropy is 0

Information gain is Gain(D, economic conditions)=1-0=1

attach

D1 (other advantages = yes) = {1-, 3+, 4+}=Ent(D1)=-(1/3log2 1/3 +2/3log2 2/3)=0.918

D2 (other advantages = none) = {2-}=Ent(D2)=-(1log2 1 +1log2 1)=0

The normalized information entropy is 3/4 0.918+1/4 0 = 0.6885

Information gain is Gain(D, economic conditions)=1-0.689=0.3115

Attributes normalized information entropy information gain
look 0 1
economy 0 1
attach 0.6885 0.3115

It can be seen that appearance and economic conditions can get the largest information gain, and you can choose appearance or economy as the attribute node of the next node

insert image description here

In this way, we have built a decision tree, and according to the conditions provided by the other party, we can use the decision tree to decide whether we should meet or not.

The algorithm rules of ID3 are relatively simple and highly interpretable. There are also defects. For example, we will find that the ID3 algorithm tends to select attributes with more values.

One drawback of ID3 is that some attributes may not have much effect on the classification task, but they may still be selected as the optimal attribute. This kind of defect does not happen every time, but there is a certain probability. ID3 produces decent decision tree classifications in most cases. In response to possible defects, later generations proposed a new algorithm for improvement, that is, C4.5.

3.C4.5 Algorithm

3.1.C4.5 algorithm idea

1. Using information gain rate

Because ID3 tends to select attributes with many values ​​when calculating. To avoid this problem, C4.5 uses information gain rate to select attributes. Information gain rate = information gain / attribute entropy.

2. Use pessimistic pruning

When ID3 constructs a decision tree, it is prone to overfitting. In C4.5, pessimistic pruning (PEP) is used after the decision tree construction, which can improve the generalization ability of the decision tree.

Pessimistic pruning is one of the post-pruning techniques. It recursively estimates the classification error rate of each internal node, and compares the classification error rate of this node before and after pruning to decide whether to prune it. This pruning method no longer requires a separate test dataset.

3. Discretization of continuous attributes

C4.5 can deal with continuous attributes and discretize continuous attributes. For example, the "height" attribute of a blind date is not divided according to "high and low", but is calculated according to height, so it is possible to take any value for height.

C4.5 selects the threshold corresponding to the partition with the highest information gain.

4. Handling missing values

3.2. How does the C4.5 algorithm handle missing values

Let's look at the following set of data. How to select attributes when the data set is missing? If the attribute division has been done, but the sample is missing in this attribute, how should the sample be divided?

insert image description here

If the exact information is not considered, the height information can be expressed as = {2-, 3+, 4+, 5-, 6+, 7-}, and the information entropy of the calculation attribute

Height=low=D1={2-,3+,4+} information entropy=-(1/3 log2 1/3+2/3 log2/3)=0.918

Height=general=D2={6+,7-} information entropy=-(1/2 log2 1/2+1/2 log1/2)=1

Height=high=D3={5-} information entropy=-(1 log2 1)=0

Calculate information gain Gain(D, height)=Ent(D)-0.792=1-0.792=0.208

Attribute entropy=-3/6log3/6 - 1/6log1/6 - 2/6log2/6=1.459

Information gain rate Gain_ratio(D′, temperature)=0.208/1.459=0.1426

The number of samples of D′ is 6, and the number of samples of D is 7, so the weight ratio is 6/7, so the weight ratio of Gain(D′, temperature) is 6/7, so: Gain_ratio(D , temperature)=6/7*0.1426=0.122.

In this way, even if the value of the temperature attribute is missing, we can still calculate the information gain and select the attribute.

3.2. Comparison of C4.5 algorithm and ID3 algorithm

ID3 algorithm

advantage:

  • The algorithm is simple and easy to understand

defect:

  • cannot handle missing values
  • Can only handle discrete values, not continuous values
  • Using information gain as a division rule tends to select features with more values. Because the more feature values, the lower the uncertainty of the division and the higher the information gain.
  • prone to overfitting

C4.5 algorithm

advantage:

  • Ability to handle default values
  • Ability to discretely process continuous values
  • Using the information gain ratio can avoid biasing the selection of features with more values. Because information gain ratio = information gain / attribute entropy, attribute entropy is calculated according to the value of the attribute, and it will be offset when divided
  • In the process of constructing the tree, branches will be pruned to reduce overfitting
    Disadvantages: constructing a decision tree requires multiple scans and sorting of the data, which is inefficient

Guess you like

Origin blog.csdn.net/b379685397/article/details/127124263