Machine Learning - Decision Tree 1 (Three Algorithms)

It's about to start... My heart is still a bit complicated
because it involves entropy... Simple entropy can be simple and
complex, how can it be understood in a popular way...
I don't have the confidence, let's write and think

1. Decision tree classification idea

First of all, the idea of ​​a decision tree is a bit like the KD tree in KNN.

The KD tree in KNN is to classify all data according to a certain feature every time.

The decision tree is also to classify all the data according to a certain feature every time.

the difference is:

  • The KD tree of KNN is a binary tree. The features of each classification may not be able to be divided cleanly. It is possible to classify according to this feature next time.

  • The classification of the decision tree is a multi-fork tree, and the characteristics of each classification should be divided as much as possible and clearly divided.
    insert image description here

When does the decision tree stop dividing?

When the attributes are all divided , or the category is uniquely determined , stop dividing (other conditions can also be used)

After the decision tree is divided, if we want to classify an object, we only need to find the corresponding category according to each feature from top to bottom.

According to the idea of ​​Tusuoji,
insert image description here
in an ideal state, the classification can be 100% correct

If only the world were so simple, who doesn’t like a black and white world?
I don’t like it...the black and white world is too cruel

In fact, it is often not clean
insert image description here

Normal people's thinking is that if there are men and women, just judge according to the proportion of men and women.

  • If the proportion of males > females, it is classified as male, otherwise it is classified as female

What a naive idea of ​​classification!

可惜有陷阱:特征的逐层筛选,有可能会因为某些无关特征,影响类别的占比

For example, hair length has nothing to do with gender.

But because we screened 头发this feature in advance, in case there are only two people with short hair (one male and one female); or people with short hair, the proportion of males and females with Adam's apple is exactly the same.

In this case, the classification of the feature of hair will interfere with the prediction of the category! ! !

I subconsciously thought of Naive Bayes.
Naive Bayesian classification is also a problem of this kind. Naive Bayesian is based on conditional independence, calculated by probability, and avoids the interference of irrelevant features.
In fact, Naive Bayesian can also be used, but what we need to understand now is Decision tree! ! !

In order to reduce the impact of such potentially irrelevant or less relevant features on the classification results, the idea of ​​the decision tree is to select features to divide the data according to the purity of the classification.

That is, as long as the features are well divided, the classification result will be purer and the classification certainty will be higher.

What is a purer classification result with higher certainty?

This requires the introduction of informatics concepts such as entropy, information gain, and information gain rate.

This is where I am very confused, especially the explanation of entropy.
There are different opinions, but I have read a lot, and I am still confused. The
confusion is that everyone seems to give a more intuitive explanation, that is, to use the metaphor of life to tell you that entropy is this thing
It's like pointing at a red apple and telling me that red is like a red apple, red is like a red flag, so red is like this color...
but I still don't understand, why is it like this?

1.1 Information volume

First of all, before talking about entropy, it is necessary to understand the amount of information.

The amount of information has also troubled me for a long time. Why is the relationship between it and probability like this:

Information amount N = log 2 ( 1 P ) N = log_2(\frac{1}{P})N=log2(P1)

It is said that the information is stored in binary, ok, I can understand why the log function with base 2 is used
but why it is 1 P \frac{1}{P}P1Then
I asked, what is the point of the reciprocal of probability?
Bernoulli's experiment told me that the reciprocal of probability indicates the number of trials required for this possibility to occur.
For example, the probability of winning a prize of 100,000 is 1 100 \frac{1}{100}1001, then it takes 100 lottery draws to be able to win
, but... what does it have to do with the amount of information...
There are more explanations, the definition is like this, and station B also has a real-life example to explain,
but I still... something is missing The feeling, the understanding is not particularly deep and transparent

I didn't realize it until I saw an explanation from a Zhihu blogger.
A more original explanation of the amount of information and entropy

The amount of information, the number of bits of binary storage used to carry the result of the possibility .

For example, to store 4 possible results, it is generally represented by one-hot encoding:
00: the first possibility
01: the second possibility
10: the third possibility
11: the fourth possibility

Therefore, only 2 bits of binary numbers are needed to represent 4 possible results.

In fact, n bits (bit) in binary can represent 2 n 2^n2n possible results, this is very basic knowledge

Assuming that the probabilities of these 4 possible outcomes are the same, then the probability of each possible outcome is 1 4 \frac{1}{4}41

according to概率的倒数,表示为某实验(事件)第一次发生所需要进行的试验次数。

The number of trials required n = 1 P n = \frac{1}{P}n=P1

Then, to make the first possible result happen, n = 1 P experiments need to be carried out, that is, 4 n = \frac{1}{P} experiments, that is, 4n=P1experiments, that is, 4 events (ideal state)

For example, the probability of a baby's butt having a birthmark is 1/4,
so among 4 babies, we need to make sure that the butt has a birthmark. Ideally, we need to check the butts of 4 people.
The first butt: no birthmark
The second Butt: No birthmark
The third butt: No birthmark
The fourth butt: There is a birthmark (not necessarily, but ideally there is, the way of heaven is reincarnation)

Similarly, to make the second possible result happen, n = 1 P experiments need to be carried out, that is, 4 n=\frac{1}{P} experiments, that is, 4n=P1experiments, that is, 4 events (ideal state)

The same is true for the third and fourth results.

therefore,要得到一个明确的结果(无论是哪种结果),平均需要进行多少次实验呢?【重点在于平均】

That is to carry out the weighted average summation of the number of trials for each possible result, that is,
∑ i P i ∗ ni = ∑ i P i ∗ 1 P i ∑_iP_i*n_i=∑_iP_i*\frac{1}{P_i}iPini=iPiPi1

1 4 ∗ 4 + 1 4 ∗ 4 + 1 4 ∗ 4 + 1 4 ∗ 4 = 4 \frac{1}{4}*4+\frac{1}{4}*4+\frac{1}{4}*4+\frac{1}{4}*4 = 4 414+414+414+414=4

So, to get a definitive result, n=4 experiments are needed on average (whatever the outcome)

But these n=4 experiments are not the amount of information we are talking about.

The results of 4 experiments are stored in binary bits, then log 2 n = log 2 4 = 2 log_2n = log_24 = 2log2n=log24=2 , need 2 binary storage bits, this is the amount of information we need.

Therefore, for n experiments, each experiment result is stored in a binary bit, log 2 n log_2n is requiredlog2n binary bits
ie:N = log 2 n N = log_2nN=log2n

存储了实验结果的二进制存储数据量,正是我们所求的信息量N

It’s reasonable, there are results and there is information
. An experiment without results is just an event without information

Therefore, the informativeness of the first possible outcome is N = log 2 ( n ) = log 2 ( 1 P ) = log 2 4 = 2 N =log_2(n)= log_2(\frac{1}{P})= log_24 = 2N=log2(n)=log2(P1)=log24=2

The amount of information of the second, third, and fourth possible results is also calculated in this way, each of which is 2

1.2 Information entropy

So, how much information is needed on average to clarify the outcome (whatever it is)?

This requires a weighted average.
∑ i P i ∗ N i = 1 4 log 2 4 + 1 4 log 2 4 + 1 4 log 2 4 + 1 4 log 2 4 = 2 ∑_iP_i*N_i=\frac{1 }{4}log_24+\frac{1}{4}log_24+\frac{1}{4}log_24+\frac{1}{4}log_24=2iPiNi=41log24+41log24+41log24+41log24=2

Therefore, to clarify the result, the average amount of information required is actually what we call the entropy H!
H = ∑ i P i ∗ N i = ∑ i P i ∗ log 2 ( 1 P i ) H = ∑_iP_i*N_i =∑_iP_i*log_2(\frac{1}{P_i})H=iPiNi=iPilog2(Pi1)

I suddenly realized it,
and I don’t know if I realized it wrong. . .

As for why everyone says that the system with greater entropy has higher uncertainty?

First, we now know that entropy is the average amount of information needed to specify a result.

The greater the entropy, the greater the average amount of information required.

According to the relationship between the amount of information and probability N = log 2 ( 1 P ) N = log_2(\frac{1}{P})N=log2(P1) , when the entropy is larger, it means that the average N is larger, and
the log logarithmic function with base 2 is an increasing function, so1 P \frac{1}{P}P1The larger the value, the smaller the P

所以:熵越大→平均信息量N越大→确定结果的平均概率P越小

The average probability P of the determined result, the smaller the P, the more uncertain the result
[It is equivalent to the average probability P when the system is sure that it is a certain result, which is different from the probability of the ball]

As an example, compare two lottery boxes

  • Lottery box No. 1: balls of 10 colors of red, orange, yellow, green, blue, purple, gold, powder, and silver. There are 10 balls in each color, and the probability of each ball is 1/10 entropy H 1 =
    Σ i = 1 10 ( 1 10 ∗ log 2 10 ) = log 2 10 ≈ 3.32193 H_1 = Σ_{i=1}^{10}(\frac{1}{10} *log_210)=log_210≈ 3.32193H1=Si=110(101log210)=log2103.32193
  • Lottery box No. 2: 99 white balls and 1 red ball, the probability of a white ball is 99 100 \frac{99}{100}10099, the probability of a red ball is 1 100 \frac{1}{100}1001
    Entropy H 2 H_2H2
    = 1 100 ∗ l o g 2 100 + 99 100 ∗ l o g 2 100 99 =\frac{1}{100} *log_2100+\frac{99}{100} *log_2\frac{100}{99} =1001log2100+10099log299100
    = 1 100 ∗ l o g 2 100 + 99 100 ∗ l o g 2 100 − 99 100 ∗ l o g 2 99 =\frac{1}{100} *log_2100+\frac{99}{100} *log_2100-\frac{99}{100} *log_299 =1001log2100+10099log210010099log299
    = l o g 2 100 − 99 100 ∗ l o g 2 99 ≈ 0.080056 =log_2100-\frac{99}{100} *log_299≈0.080056 =log210010099log2990.080056
    Judging from the entropy of lottery box No. 1 and lottery box No. 2, H_2<H_1,
    which means that the result of lottery box No. 1 has a relatively low certainty probability, and the result of lottery box No. 2 has a relatively high certainty.
    From an intuitive point of view, the same is true. Lottery box No. 2, draw one at random, there is a greater possibility that it is a white ball, the result is more certain,
    but in lottery box No. 1, draw one randomly, it may be any one of red, orange, yellow, green, blue, purple, gold, powder and silver balls species, with greater uncertainty

Therefore, in view of the relationship among entropy, information amount N, event amount (experimental amount n), and event probability P, what needs to be clarified is:
熵表示结果确定时(可以是确定任一结果),所需的平均信息量

  • The greater the entropy, the less certain the outcome of the event is (the less certain the outcome is, the less sure the classification is)
  • The smaller the entropy, the greater the certainty of the event result (it is clear which result is more certain, and the classification is more sure)

The information entropy here is based on the classification results to measure the confusion of the system classification

That is, just look at the results of the classification without considering the features and so on

For example, judging the level of a class by only looking at the test results of a class - just looking at the results
rather than judging the level of the class based on the characteristics of the class, such as the level of teachers, students, educational resources, etc. - - don't look at the process

But don't forget that the decision tree is classified according to the characteristics, so the confusion of the classification results does not mean that the decision tree is classified according to the characteristics with high certainty.

Therefore, we must always remember that the decision tree is classified according to features, so which feature is selected for classification can make the overall classification more certain (that is, which feature is selected for classification can reduce the inconsistency of classification results. Certainty)

At present, assuming that Y is the classification result, then just to judge the uncertainty of the classification through the classification result is to calculate the entropy of Y

H ( Y ) = Σ P ( y i ) l o g 2 1 P ( y i ) H(Y) = ΣP(y_i)log_2\frac{1}{P({y_i)}} H(Y)=Σ P ( yi)log2P ( andi)1——Note: yi y_i hereyimeans, the i-th category

This H(Y) indicates that the overall classification uncertainty of the current system

So, according to which characteristics can classification reduce the uncertainty of the overall system?
How is this reduction in uncertainty measured and calculated?

First of all, let’s understand it in common life situations:
Assuming that we currently only have gender classification data, males account for 60% and females account for 40%.
Then, when judging the category of an object based on the classification results only, the classification uncertainty That is
H ( Y ) = P male log 2 ( 1 P male) + P female log 2 ( 1 P female) H (Y) = P_male log_2(\frac{1}{P_male})+P_female log_2(\frac{1}{P_female})HY=Pmalelog2(Pmale1)+PFemalelog2(PFemale1)
But if we classify these classification results again according to a certain feature, can it reduce the uncertainty of classification?
Yes!
If we classify and count these data according to whether there is an Adam's apple:

  • Proportion of men with Adam's apple: P (male | with Adam's apple)
  • Proportion of men without Adam's apple: P(Male|No Adam's apple)
  • Proportion of women with Adam's apple: P (female | with Adam's apple)
  • Proportion of women without Adam’s apple: P(female|no Adam’s apple)
    It will be found that according to the data classified by Adam’s apple, the classification result will be more pure, that is: basically all men with Adam’s apple, and basically all women without Adam’s apple. Then, according
    to Classifying the feature of the Adam's apple will reduce the uncertainty of the classification results

这个不确定性的减少使怎么衡量,如何计算的呢?

This is about [conditional entropy]

1.3 Conditional entropy

First of all, suppose we first classify and count the data according to the feature of whether there is an Adam's apple:

  • Proportion of men with Adam's apple: P (male | with Adam's apple)
  • Proportion of men without Adam's apple: P(Male|No Adam's apple)
  • Proportion of women with Adam's apple: P (female | with Adam's apple)
  • Proportion of women without Adam's apple: P(female|No Adam's apple)

According to the feature of Adam's apple , after classification, the average amount of information required when the classification result is determined, that is, entropy, we call it conditional entropy H(Y|X)
这个定义呢,呵呵,还是有点儿问题,不管它,后边再调整

This conditional entropy is actually calculated based on the conditional probability.
The information entropy formula: H = ∑ i P i ∗ N i = ∑ i P i ∗ log 2 ( 1 P i ) H = ∑_iP_i*N_i = ∑_iP_i*log_2 (\frac{1}{P_i})H=iPiNi=iPilog2(Pi1)
条件熵公式: H ( y ∣ X ) = ∑ i P y ∣ x i ∗ N y ∣ x i = ∑ i P y ∣ x i ∗ l o g 2 ( 1 P y ∣ x i ) H(y|X) = ∑_iP_{y|x_i}*N_{y|x_i} =∑_iP_{y|x_i}*log_2(\frac{1}{P_{y|x_i}}) H(yX)=iPyxiNyxi=iPyxilog2(Pyxi1)
虽然,这个公式是有问题的,但先不管,按照常规思维先推下去
Note that herexi x_ixi, not the i-th feature, but the eigenvalue of a certain feature x

Similar to the feature x, it is assumed that there is no Adam's apple, then x 0 x_0x0Can represent no Adam's apple, x 1 x_1x1It can mean that there is an Adam's apple, and it can be reversed.
I just want to express, xi x_ixithe meaning of

H ( y ∣ X ) H(y|X) The y in H ( y X ) actually only represents one classification result, and the actual classification result Y may have multiple categories
Y: y 0 , y 1 , y 2 . . . Y: y_0,y_1,y_2 ...Y:y0,y1,y2...

Therefore, the complete conditional entropy should be
H ( Y ∣ X ) = H ( y 0 ∣ X ) + H ( y 1 ∣ X ) + H ( y 2 ∣ X ) + . . . . H ( ym ∣ X ) H(Y|X)=H(y_0|X) +H(y_1|X)+H(y_2|X)+....H(y_m|X)H(YX)=H(y0X)+H(y1X)+H(y2X)+....H(ymX)

Among them, X feature has n kinds of eigenvalues, Y has m kinds of categories
H ( y 0 ∣ X ) = ∑ i = 1 n P y 0 ∣ xi ∗ log 2 ( 1 P y 0 ∣ xi ) H(y_0|X) = ∑_{i=1}^{n}P_{y_0|x_i}*log_2(\frac{1}{P_{y_0|x_i}})H(y0X)=i=1nPy0xilog2(Py0xi1)
H ( y 1 ∣ X ) = ∑ i = 1 n P y 1 ∣ x i ∗ l o g 2 ( 1 P y 1 ∣ x i ) H(y_1|X) = ∑_{i=1}^{n}P_{y_1|x_i}*log_2(\frac{1}{P_{y_1|x_i}}) H(y1X)=i=1nPy1xilog2(Py1xi1)

H ( y m ∣ X ) = ∑ i = 1 n P y m ∣ x i ∗ l o g 2 ( 1 P y m ∣ x i ) H(y_m|X) = ∑_{i=1}^{n}P_{y_m|x_i}*log_2(\frac{1}{P_{y_m|x_i}}) H(ymX)=i=1nPymxilog2(Pymxi1)

Finally combined, it is
H ( Y ∣ X ) = ∑ j = 1 m ∑ i = 1 n P yj ∣ xi ∗ log 2 ( 1 P yj ∣ xi ) H(Y|X) = ∑_{j=1} ^{m} ∑_{i=1}^{n}P_{y_j|x_i}*log_2(\frac{1}{P_{y_j|x_i}})H(YX)=j=1mi=1nPyjxilog2(Pyjxi1)

It seems reasonable,
but compared with the book, oh~~~ It seems very wrong! ! ! !
I searched for an explanation from a Zhihu blogger on the Internet, a popular explanation of conditional entropy.
Sure enough, my original derivation was really wrong...haha

Re-understand,条件熵:Y的条件概率分布的熵的平均期望

The explanation in this textbook is really, people have no desire to explore

Let's understand it step by step. First, what is the conditional probability distribution of Y?
In fact, it is the statistical classification at the beginning:

  • Proportion of men with Adam's apple: P (male | with Adam's apple)
  • Proportion of men without Adam's apple: P(Male|No Adam's apple)
  • Proportion of women with Adam's apple: P (female | with Adam's apple)
  • Proportion of women without Adam's apple: P(female|No Adam's apple)

So what is the entropy of the conditional probability distribution of Y?

In fact, it is the entropy after classification according to the characteristics

H ( Y ∣ Adam’s apple) = P (male ∣ Adam’s apple) ∗ log 2 ( 1 P (male ∣ Adam’s apple) ) + P (female ∣ Adam’s apple) ∗ log 2 ( 1 P (female ∣ Adam’s apple) ) H (Y|has Adam’s apple) = P_{(male|has Adam’s apple)}*log_2(\frac{1}{P_{(male|has Adam’s apple)}})+P_{(female|has Adam’s apple)}*log_2(\ frac{1}{P_{(female|has Adam's apple)}})H ( Y Adam's apple )=P(Male∣Adam 's apple )log2(P(Male∣Adam 's apple )1)+P(Female∣Adam 's apple )log2(P(Female∣Adam 's apple )1)

H ( Y ∣ no Adam’s apple) = P (male ∣ no Adam’s apple) ∗ log 2 ( 1 P (male ∣ no Adam’s apple) ) + P (female ∣ no Adam’s apple) ∗ log 2 ( 1 P (female ∣ no Adam’s apple) ) H (Y|no Adam's apple) = P_{(male|no Adam's apple)}*log_2(\frac{1}{P_{(male|no Adam's apple)}})+P_{(female|no Adam's apple)}*log_2(\ frac{1}{P_{(female|no Adam's apple)}})H ( Y No Adam’s apple )=P( male∣no Adam 's apple )log2(P( male∣no Adam 's apple )1)+P(Female∣No Adam 's apple )log2(P(Female∣No Adam 's apple )1)

This is the classification result entropy of different feature values ​​after classification according to the features

Classifying according to this feature is equivalent to a large group divided into two sub-groups

When the classification result of each subgroup is determined, there is an average amount of information required by each subgroup, that is, the entropy of the conditional probability distribution

  • Objects with an Adam's apple are a subgroup, and the average amount of information required to determine the classification result in this group is the entropy H ( Y ∣ with an Adam's apple) H(Y|with an Adam's apple)H ( Y Adam's apple )
  • The object without Adam's apple is another subgroup. The average amount of information required for the classification result in this group is entropy H ( Y ∣ without Adam's apple) H(Y|without Adam's apple)H ( Y No Adam’s apple )

When classifying according to this Adam's apple feature, the average amount of information required is the weighted average sum of the entropy of the conditional probability distribution

That is, P (with Adam’s apple) H (Y ∣ with Adam’s apple) + P (without Adam’s apple) H (Y ∣ without Adam’s apple) P (with Adam’s apple) H(Y|with Adam’s apple)+P (without Adam’s apple) H(Y|no Adam's apple)P ( with Adam’s apple ) H ( Y ∣with Adam’s apple )+P ( without Adam's apple ) H ( Y ∣without Adam's apple )

This is exactly the conditional entropy! ! ! I understand right now...

So the formula of conditional entropy is
H ( Y ∣ X ) = P (with Adam’s apple) H ( Y ∣ with Adam’s apple) + P (without Adam’s apple) H ( Y ∣ without Adam’s apple) H(Y|X) =P (with Adam’s apple) H(Y|with Adam’s apple)+P(without Adam’s apple)H(Y|without Adam’s apple)H(YX)=P ( with Adam’s apple ) H ( Y ∣with Adam’s apple )+P ( without Adam's apple ) H ( Y ∣without Adam's apple )

P (with Adam’s apple) H ( Y ∣ with Adam’s apple) P (with Adam’s apple) H(Y|with Adam’s apple)P ( with Adam’s apple ) H ( Y with Adam’s apple )
= P (with Adam’s apple) [ P (male ∣ with Adam’s apple) ∗ log 2 ( 1 P (male ∣ with Adam’s apple) ) + P (female ∣ with Adam’s apple) ∗ log 2 ( 1 P (female∣ Adam’s apple) ) ] =P(Adam’s apple)[P_{(Male|Adam’s apple)}*log_2(\frac{1}{P_{(Male|Adam’s apple)}})+P_{ (Female|Adam's apple)}*log_2(\frac{1}{P_{(Female|Adam's apple)}})]=P ( with Adam's apple ) [ P(Male∣Adam 's apple )log2(P(Male∣Adam 's apple )1)+P(Female∣Adam 's apple )log2(P(Female∣Adam 's apple )1)]

P (No Adam’s apple) H (Y ∣ No Adam’s apple) P(No Adam’s apple)H(Y|No Adam’s apple)P ( no Adam’s apple ) H ( Y no Adam’s apple )
= P (no Adam’s apple) [ P (male ∣ no Adam’s apple) ∗ log 2 ( 1 P (male ∣ no Adam’s apple) ) + P (female ∣ no Adam’s apple) ∗ log 2 ( 1 P (female∣ no Adam's apple) ) ] =P(no Adam's apple)[P_{(male|no Adam's apple)}*log_2(\frac{1}{P_{(male|no Adam's apple)}})+P_{ (female|no Adam's apple)}*log_2(\frac{1}{P_{(female|no Adam's apple)}})]=P ( no Adam's apple ) [ P( male∣no Adam 's apple )log2(P( male∣no Adam 's apple )1)+P(Female∣No Adam 's apple )log2(P(Female∣No Adam 's apple )1)]

实际,也就是先算出每个特征值下的条件概率的熵,如下:
H ( Y ∣ x 1 ) = ∑ i = 1 m P ( y i ∣ x 1 ) ∗ l o g 2 ( 1 P ( y i ∣ x 1 ) ) H(Y|x_1)=∑_{i=1}^{m}P_{(y_i|x_1)}*log_2(\frac{1}{P_{(y_i|x_1)}}) H(Yx1)=i=1mP(yix1)log2(P(yix1)1)
H ( Y ∣ x 2 ) = ∑ i = 1 m P ( y i ∣ x 2 ) ∗ l o g 2 ( 1 P ( y i ∣ x 2 ) ) H(Y|x_2)=∑_{i=1}^{m}P_{(y_i|x_2)}*log_2(\frac{1}{P_{(y_i|x_2)}}) H(Yx2)=i=1mP(yix2)log2(P(yix2)1)

H ( Y ∣ x n ) = ∑ i = 1 m P ( y i ∣ x n ) ∗ l o g 2 ( 1 P ( y i ∣ x n ) ) H(Y|x_n)=∑_{i=1}^{m}P_{(y_i|x_n)}*log_2(\frac{1}{P_{(y_i|x_n)}}) H(Yxn)=i=1mP(yixn)log2(P(yixn)1)
再计算出每个特征值下的条件概率的熵的加权平均和,其实就是条件熵
H ( Y ∣ X ) = P ( x 1 ) H ( Y ∣ x 1 ) + P ( x 2 ) H ( Y ∣ x 2 ) + . . . + P ( x n ) H ( Y ∣ x n ) H(Y|X)=P(x_1)H(Y|x_1)+P(x_2)H(Y|x_2)+...+P(x_n)H(Y|x_n) H(YX)=P(x1)H(Yx1)+P(x2)H(Yx2)+...+P(xn)H(Yxn)

= ∑ j = 1 n P ( x j ) ∑ i = 1 m P ( y i ∣ x j ) ∗ l o g 2 ( 1 P ( y i ∣ x j ) ) =∑_{j=1}^{n}P(x_j)∑_{i=1}^{m}P_{(y_i|x_j)}*log_2(\frac{1}{P_{(y_i|x_j)}}) =j=1nP(xj)i=1mP(yixj)log2(P(yixj)1)

The above is the distribution and meaning of conditional entropy

Conditional entropy refers to the average information entropy of the information entropy of each subgroup after being divided into several subgroups according to a certain characteristic.
insert image description here

Logically speaking, when we select a certain feature to divide the group in the decision tree, we should choose the feature that can minimize the conditional entropy, that is: first
calculate the conditional entropy corresponding to all features, and select the one with the smallest conditional entropy Features are classified
so that the certainty of the classification results is guaranteed to be higher

但不知道为什么,还要提出信息增益

1.4 Information Gain

Information gain indicates the degree to which the certainty of group classification increases after the group is divided according to this feature, that is, how much uncertainty is reduced.

G a i n = H ( Y ) − H ( Y ∣ X ) Gain = H(Y)-H(Y|X) Gain=HYH ( Y X )
Information Gain = Total Entropy − Conditional Entropy Information Gain = Total Entropy - Conditional Entropyinformation gain=total entropyconditional entropy

When the total entropy is fixed, the smaller the conditional entropy, the greater the information gain, which means that classification according to this feature will bring more information and greatly reduce uncertainty

To be honest, I think that such a thing as information gain is actually a bit disruptive to the logic, and it is completely unnecessary. It is
like, now to compare which strategy can reduce errors, we calculate the error amount (conditional entropy) under each strategy separately, as long as Comparing each strategy horizontally, which error is smaller, means that the strategy has a better effect-this is very simple, but the
information gain is like gain = [the total error amount under the original strategy - the error amount under the strategy] , and then compare the gains under each strategy horizontally: if the gain is larger, the strategy will perform better (it can reduce the amount of errors to a greater extent), the logic is fine, but why is it necessary?

信息增益无非是一种正向线性思维逻辑:gain越大,特征划分表现越好;gain越小,特征划分表现越差
条件熵则是一种反向线性思维逻辑:条件熵越小,特征划分表现越好;条件熵越大,特征划分表现越差

But the logical formulation of conditional entropy... is clearly more rhyming... small-good, big-poor...
things that rhyme, naturally beautiful

But why is information gain not good enough? Why do people propose the information gain rate?

书上说,信息增益对那些特征值比较多的特征有所偏好,也就是说,采用信息增益作为判定方法,会倾向于去选择特征值比较多的特征。
oh? Why does information gain favor those features with more eigenvalues?

This goes back to the formula of information gain. First, the total entropy H(Y) is fixed, that is, the conditional entropy is as small as possible, and the feature G ain = H ( Y )
− H ( Y ∣ X ) Gain = H(Y)-H(Y|X)Gain=HYHYX)

Therefore, the defect of information gain is: information gain has a preference for features with a lot of eigenvalues. In other words, features with a lot of eigenvalues ​​usually have a relatively small conditional entropy.

oh? why? Further dismantling the formula of conditional entropy

H ( Y ∣ X ) = P ( x 1 ) H ( Y ∣ x 1 ) + P ( x 2 ) H ( Y ∣ x 2 ) + . . . + P ( x n ) H ( Y ∣ x n ) H(Y|X)=P(x_1)H(Y|x_1)+P(x_2)H(Y|x_2)+...+P(x_n)H(Y|x_n) H(YX)=P(x1)H(Yx1)+P(x2)H(Yx2)+...+P(xn)H(Yxn)
Now to analyze, why the more values ​​of x, the smaller H(Y|X)?

First of all, you can imagine an extreme scenario. If a certain characteristic can divide the total population into 100 subgroups, and each subgroup has only 2 objects, then the classification certainty of each > subgroup is very high, and some subgroups has an entropy of 0, and some subpopulations have an entropy of 3 (or greater).

  • But since the subpopulation probability P ( xn ) P(x_n)Pxn) is relatively small, so when the entropy of the subgroup is large, P is small, and the final cumulative weighted average conditional entropy H(Y|X) is also relatively small.

Another feature is that the total population can be divided into two subgroups, and each subgroup has 50 objects, so each subgroup is most likely not particularly pure, and the entropy of the subgroup is relatively larger.

  • And since the subpopulation probability P ( xn ) P(x_n)Pxn) is relatively large, and when the entropy of the subgroup is large, the final cumulative weighted average conditional entropy H(Y|X) will also be relatively large

The above conclusions are based on the analysis of real experience, but it can actually be analyzed at the mathematical level

In this case, the conditional entropy of features with more eigenvalues ​​is generally small. Therefore, when selecting features to divide groups based on information gain, the features with more eigenvalues ​​are often selected.

Why shouldn't we choose features with many eigenvalues ​​to divide groups?

Because it is easy to overfit, and at the same time, due to the large number of eigenvalues, it will be divided into more subgroups, which may lead to a relatively small number of objects in each group, and classification when the number of objects is small. prone to misclassification

Assuming that there are only 2 people with a height of 2.2 meters in the whole class, the number of objects in this subgroup is too small to be classified according to object categories

  • For example: Assuming that these two people are both girls, then when a person is 2.3 meters tall, when they are divided into groups above 2.2 meters according to their height characteristics, there are two historical data showing that they are both girls, so it is determined that the person with a height of 2.3 meters is also a girl - This is obviously unreasonable

Too many eigenvalues, easy to overfit

Since information gain is susceptible to interference with more eigenvalues, then find a way to eliminate or weaken such interference.

How to remove it?

We know that more eigenvalues ​​will easily lead to smaller entropy of each divided subgroup, and smaller probability of each subgroup, and finally make the average sum of each subgroup - the more subgroups are divided: the conditional entropy is smaller , the information gain is large .

But the more subgroups are divided, the entropy of each subgroup is how big. This kind of logic is also the same for classification. If the characteristics are not considered, the classification of the overall group is inherently more, then the probability of subgroups of each category is small, and the overall entropy is relatively large.

This is from the entropy formula
H = ∑ i P i ∗ N i = ∑ i P i ∗ log 2 ( 1 P i ) H = ∑_iP_i*N_i =∑_iP_i*log_2(\frac{1}{P_i })H=iPiNi=iPilog2(Pi1)

Transform: H = P 1 ∗ log 2 ( 1 P 1 ) + P 2 ∗ log 2 ( 1 P 2 ) + P 3 ∗ log 2 ( 1 P 3 ) + . . . . H = P_1*log_2(\frac {1}{P_1})+P_2*log_2(\frac{1}{P_2})+P_3*log_2(\frac{1}{P_3})+....H=P1log2(P11)+P2log2(P21)+P3log2(P31)+....

Assuming that the probability P value of each category is the same, that is, the proportion of each category is the same, then we compare the number of different categories to see the comparison of entropy. Suppose there are 3 categories, and the probability P of each category is the same
, all of which are 1/3 :

  • H = ∑ i P i ∗ l o g 2 ( 1 P i ) = 3 ∗ P ∗ l o g 2 ( 1 P ) = l o g 2 ( 3 ) H =∑_iP_i*log_2(\frac{1}{P_i})=3* P*log_2(\frac{1}{P})=log_2(3) H=iPilog2(Pi1)=3Plog2(P1)=log2(3)

Suppose there are 100 categories, and the probability P of each category is the same, both are 1/100:

  • H = ∑ i P i ∗ l o g 2 ( 1 P i ) = 100 ∗ P ∗ l o g 2 ( 1 P ) = l o g 2 ( 100 ) H =∑_iP_i*log_2(\frac{1}{P_i})=100* P*log_2(\frac{1}{P})=log_2(100) H=iPilog2(Pi1)=100Plog2(P1)=log2(100)

Now, instead of calculating entropy by categories, we divide groups by features to calculate entropy (not conditional entropy)

H ( X ) = ∑ x i n P x i ∗ N i = ∑ x i n P x i ∗ l o g 2 ( 1 P x i ) H(X) = ∑_{x_i}^nP_{x_i}*N_i =∑_{x_i}^nP_{x_i}*log_2(\frac{1}{P_{x_i}}) H(X)=xinPxiNi=xinPxilog2(Pxi1)

When there are more groups divided by features (n is larger), the entropy H(X) of the feature groups is also larger.

Therefore, we can use the information gain Gain ÷ the entropy H(X) of the feature group, so that due to the influence of many features, it will be offset to a certain extent.

Because both the information gain Gain and the entropy H(X) of the characteristic group will increase due to the increase of the characteristic value, then the impact of the increase in the characteristic value can be eliminated to a certain extent by comparing them two by two.

It's like, assuming that two people are heavier, we consider that eating will cause weight gain at a certain moment, so we plan to eliminate the weight gain effect caused by eating. First of all, we know that the more you eat, the more weight you will gain
. The more you eat, the more you eat at the same time, the more you sweat.
If we can't count the amount of food, we can count the amount of sweat.
Then we can divide the weight by the amount of sweat to eliminate the weight gain caused by eating.
em。。。感觉不太对劲

do not care

The ratio of the information gain to the entropy of the characteristic group is the information gain rate
Gain _ rate = Gain H ( X ) Gain\_rate = \frac{Gain}{H(X)}Gain_rate=H(X)Gain

Let’s talk about the Gini coefficient. The Gini coefficient is the third indicator used to measure whether the classification by a certain feature is more certain.

1.5 Gini coefficient

The Gini coefficient, more precisely, should be called the Gini impurity coefficient

It is not the same thing as the Gini coefficient, an economic indicator used to measure the gap between the rich and the poor in the country! ! !
At the beginning, because I didn’t understand the Gini coefficient, I also specially studied the Gini coefficient of economic indicators. Finally, I found that it seemed to be not the same thing. The more I learned, the more
confused I became, so that I felt an aversion to the decision tree...caused by the fear of difficulties.

First of all, after the major up masters painstakingly shared their understanding, I found that the principle of the Gini impurity coefficient is actually very simple

The Gini impurity coefficient is actually the average error rate of the decision tree classification

First of all, let's not talk about the classification by features, but directly look at the classification results.

Assume that in the total group, there are 3 A, 3 B, 6 C, a total of 12 objects, 3 categories

So what is the classification error rate for each object?
For an object A, the classification error is classified as B or C, and the proportion of B or C is its classification error rate, that is, 12 − 3 12 = 9 12 \frac{12-3}{12}=\frac{9 }{12}12123=129
For the second A object, the classification error rate is also 9 12 \frac{9}{12}129, the third A object, the classification error rate is the same

A B object, if the classification is wrong, it is divided into A or C, so the classification error rate is 9 12 \frac{9}{12}129
For the other two B objects, the classification error rate is the same 9 12 \frac{9}{12}129

For a C object, the classification error rate can be deduced to be 6 12 \frac{6}{12}126
For the rest of the C objects, the classification error rate is the same 6 12 \frac{6}{12}126

Then in this group, the total classification error rate is 3 ∗ 9 12 + 3 ∗ 9 12 + 6 ∗ 6 12 3*\frac{9}{12}+3*\frac{9}{12}+6*\ frac{6}{12}3129+3129+6126
Then, the average classification error rate of this group is
3 ∗ 9 12 + 3 ∗ 9 12 + 6 ∗ 6 12 12 = 3 12 ∗ 9 12 + 3 12 ∗ 9 12 + 6 12 ∗ 6 12 \frac{3*\frac {9}{12}+3*\frac{9}{12}+6*\frac{6}{12}}{12}=\frac{3}{12}*\frac{9}{12} +\frac{3}{12}*\frac{9}{12}+\frac{6}{12}*\frac{6}{12}123129+3129+6126=123129+123129+126126

= 3 12 ∗ ( 1 − 3 12 ) + 3 12 ∗ ( 1 − 3 12 ) + 6 12 ∗ ( 1 − 6 12 ) =\frac{3}{12}*(1-\frac{3}{12})+\frac{3}{12}*(1-\frac{3}{12})+\frac{6}{12}*(1-\frac{6}{12}) =123(1123)+123(1123)+126(1126)

This is the formula for calculating the Gini coefficient. We abstract it as follows:
G ini = Σ P ( 1 − P ) Gini = ΣP(1-P)Gini=Σ P ( 1P)

The above is the Gini coefficient when it is not classified according to the characteristics, but now it is necessary to measure the certainty of the classification results after classification according to a certain characteristic, that is, to divide the Gini coefficients of the characteristics into multiple subgroups, and perform a weighted summation. (similar to conditional entropy)

First calculate the Gini coefficient of multiple subgroups divided by characteristics

G i n i ( Y ∣ x 1 ) = ∑ i = 1 m P ( y i ∣ x 1 ) ∗ ( 1 − P ( y i ∣ x 1 ) ) Gini(Y|x_1)=∑_{i=1}^{m}P_{(y_i|x_1)}*(1-P_{(y_i|x_1)}) Gini(Yx1)=i=1mP(yix1)(1P(yix1))
G i n i ( Y ∣ x 2 ) = ∑ i = 1 m P ( y i ∣ x 2 ) ∗ ( 1 − P ( y i ∣ x 2 ) ) Gini(Y|x_2)=∑_{i=1}^{m}P_{(y_i|x_2)}*(1-P_{(y_i|x_2)}) Gini(Yx2)=i=1mP(yix2)(1P(yix2))

G i n i ( Y ∣ x n ) = ∑ i = 1 m P ( y i ∣ x n ) ∗ ( 1 − P ( y i ∣ x n ) ) Gini(Y|x_n)=∑_{i=1}^{m}P_{(y_i|x_n)}*(1-P_{(y_i|x_n)}) Gini(Yxn)=i=1mP(yixn)(1P(yixn))

Based on the ratio of each subgroup, these Gini coefficients are weighted and summed to obtain the final Gini coefficient

G i n i = P ( x 1 ) G i n i ( Y ∣ x 1 ) + P ( x 1 ) G i n i ( Y ∣ x 2 ) + . . . + P ( x n ) G i n i ( Y ∣ x n ) Gini = P(x_1)Gini(Y|x_1)+P(x_1)Gini(Y|x_2)+...+P(x_n)Gini(Y|x_n) Gini=P(x1)Gini(Yx1)+P(x1)Gini(Yx2)+...+P(xn)Gini(Yxn)

This process is actually very similar to conditional entropy. The calculation logic is the same, but the selection of indicators is different.

But will the Gini coefficient change due to the increase in eigenvalues?

Actually there must be

For example, if there are too many eigenvalues, the classification of each subgroup will be relatively pure, then the Gini coefficient P*(1-P) of the subgroup will be very small, and the proportion of each subgroup will be relatively small, then the overall The weighted average sum - the Gini coefficient is also smaller.

Therefore, if the Gini coefficient is used as an indicator to measure feature selection, then the Gini coefficient will be the same as the conditional entropy, and the index with more feature values ​​will be selected, which is prone to overfitting.

Therefore, we should impose a penalty on the number of eigenvalues.

This is what I take for granted, but I found it in the textbook. . . Only the Gini coefficient is used as a measure. . . em. . . .

Perhaps at least the Gini coefficient gain should be used to measure it. . . But apparently some books don't

The Gini coefficient gain I imagined: Gini coefficient of the total population - Gini coefficient after classification by characteristics

The Gini coefficient gain rate in my imagination: The Gini coefficient gain rate is divided into the weighted average of the Gini coefficient of multiple groups according to the characteristics \frac{Gini coefficient gain rate}{The weighted average of the Gini coefficient of multiple groups according to the characteristics}Gini Coefficient Weighted Average of Multiple Groups by CharacteristicsGini coefficient gain rate

I don't know if other books have special mentions, but I will try it with a program later.

Nice, at this point, I think I have a thorough understanding of the three decision-making principles of the feature selection of the decision tree! ! !
It’s really not easy. Sure enough, you still need to sort out the words to understand it. You have to give
examples yourself to know what’s going on
. You have to slap yourself in the face and sincerely lower your ignorant head, just to better exercise the cervical spine
nice

2. Summary

  • Information Gain: ID3 Algorithm
  • Information Gain Rate: C4.5 Algorithm
  • Gini coefficient: CART algorithm

Why are they called these names? don't know. . . unimportant. . .

Guess you like

Origin blog.csdn.net/weixin_50348308/article/details/131383164