In-depth understanding of the optimal decision tree classification rule

The author Key, Home Park blog: https: //home.cnblogs.com/u/key1994/

This content be original works, please indicate the source or contact: [email protected]

 

Today, learning the principles of decision tree classification, in general, understand the decision tree SVM simpler than trying to understand, for two reasons:

(1) more similar decision tree classification of human thought and thought patterns - that is to divide and conquer.

(2) is fixed on the rules of the decision tree implementation, it is a single mode, which reduces the difficulty of understanding.

However, the decision tree is introduced in the selection of the optimal division of property of the concept of entropy (Entropy) and information gain (Information Gain), and to understand more abstract, this article will use simple examples to deepen understanding.

 

The figure is a simplified model used herein, the model data to organize the table:

x

Y

class

x

Y

class

1

1

cir

3

4

tan

1

2

cir

3

5

tan

1

3

cir

3

6

tan

1

4

tan

4

1

cir

1

5

tan

4

2

tri

1

6

tan

4

3

tri

2

1

cir

4

4

tri

2

2

cir

4

5

tri

2

3

cir

4

6

tri

2

4

tan

5

1

cir

2

5

tan

5

2

tri

2

6

tan

5

3

tri

3

1

cir

5

4

tri

3

2

cir

5

5

tri

3

3

cir

5

6

tri

Obviously, when building a decision tree model, there are two ideas:

(1) according to the x coordinate of the classification, and then sorted by y coordinate;

(2) at y first classification, and classification according to the x coordinate.

As to which option better choice, according to the theory of decision trees, information gain were obtained.

But here I have a question, if we choose to program (1), which is to be classified in accordance with the x coordinate. Obviously, there are three categories of all samples, but the value of x has 5 (1,2,3,4,5, respectively). According to the decision tree theory, should be chosen as follows:

 

 

The distribution of sample data can be seen that if a more efficient manner as follows:

 

 

However, this method does not have the general classification, after all, not all problems can be observed by the naked eye the law, we have to do is to make the decision themselves to the machine. Anyway, here we put aside the question of the matter, because the content of this article is to discuss the substance of entropy and information gain.

Next, it is the core content of this article.

Comparison of two options presented above, we believe that it is difficult to distinguish the good from the bad. Below we obtained information gain.

plan 1):

First classified in accordance with the x-coordinate.

The respective nodes are calculated :( entropy calculation process is omitted)

Entropy(parent)=1.58

Entropy(children1)=1

Entropy(children2)=0.65

IG=0.72

Scenario 2):

First classified in accordance with the y-coordinate.

 

 

The respective nodes are calculated :( entropy calculation process is omitted)

Entropy(parent)=1.58

Entropy(children1)=0.8367

Entropy(children2)=0.971

IG=0.8067

Information gain value from the point of view, should choose the second option to classify. However, if you choose the second option, to give the final classification accuracy of the classifier will be lower than the first embodiment (which is very easy to prove that discussion, the reader can calculate their own). That is to say, to judge the accuracy of the classifier, the second program less effective. So why is there such a phenomenon?

Here we come from analysis of principle.

What is the entropy, by definition, the entropy characterizing the sample set purity (Purity) size. If all samples only one category, then the time of very high purity, is obtained by calculating the entropy 0; if there are multiple classes of samples, the purity becomes low, it will also increase entropy. The most exceptional circumstances, that is, in all samples, the number of each category are equal, then the entropy value of 1, the lowest purity.

Weighting information gain is calculated entropy Save parent node of each child node and the entropy, that is, according to the current property divided by the parent node to the child node purity variation value. Gain value information must not less than zero, because the purity of each divided after child nodes must be increasingly high, i.e., entropy child nodes must be in the smaller, or will even after the weighted summation is less than the parent entropy. In addition, the larger the gain value information, described more obvious increased purity. We certainly hope to improve the purity of the speed a little faster, so that we can with the fastest speed to complete the training of the classifier.

Here we note that, when used to determine the division of property in accordance with the size of the information gain, our main consideration is the classification of the pro-speed train, rather than accuracy, which explains why, in the above example, the accuracy will be lower than Option II Option One .

This appears to be a bug, why gain information theory so popular these days it? We dare guess what, in practical problems, the properties of a model it is possible to have more than, say if there are 10, while the value of each property can be three, so we will be very complex model, the amount will be calculated greatly increased. The complexity of the real problems are likely to than I assume that there surpasses. So, we have to consider the problem of computing speed, even if the loss of accuracy needs to be part of the cost. And this is a problem in machine learning is very common - that seek to balance the amount of calculation and accuracy!

In addition, each sample point if we do not want to misclassification, how to avoid overfitting it?

Since the mentioned decision trees, some of it the way I understand the decision tree algorithm. If inaccuracies, I hope that readers criticized the correction.

First, although the decision tree can be used to return, but its main application or classification;

Secondly, the use of decision tree algorithm, if you want to get a good classification results, which must be given the premise of big data. Only the amount of data is large enough, the decision tree to be able to facilitate a variety of situations that may arise. The large amounts of data and give training challenges;

Finally, the parameter tree selection is a complex process. As already proved good training effect based on gain information to select the best method of dividing the property may not be taken. To improve the classification accuracy of the decision tree, there are many parameters that need to be determined in advance, such as the criterion (entropy / gini), spllitter (random / best), max_depth (decision tree of depth), min_samples_split (minimum classification number of samples) and so on, the process of selecting these parameters is complex.


 

Guess you like

Origin www.cnblogs.com/key1994/p/11470219.html