Data Mining - decision tree classification

Decision tree classification is a classification algorithm in data mining analysis. As the name suggests, it is a decision tree to make decisions based on the "tree" structure, is a human in the face of a very natural decision problem handling mechanism. For example, a simple determination of the decision tree in FIG buying a computer:

 

 

 

 

 

 

 

 

 

 The figure is a test data set, we use this data set, for example, look at how to generate a decision tree.

 

 The main task is to determine the classification decision tree decision area each category, or that determine the boundaries between the different categories. In the decision tree classification model, the boundaries between the different categories is represented by a tree structure.

 

 Through the above analysis, we can draw the following points:

  • The maximum number of decision attribute height =
  • Tree shorter the better
  • Make important good property on the root of the tree

Therefore, the decision tree algorithm is achievements: Select the root of the process

 

 

 

 The first step, select Properties as the root

The more popular attribute selection methods: information gain

Information gain the greatest attribute is considered the best roots

Before selecting a property, we start to understand a concept: entropy What is entropy? What is information? How to measure them?

The following article explains the concept easy to understand

http://www.360doc.com/content/19/0610/07/39482793_841453815.shtml

Entropy is used to indicate the size of the uncertainty

Information used to eliminate uncertainty

In fact, given the training set S, gain information on behalf of any of the information, like this S Category in need without regard to any input variable is determined (to eliminate the uncertainty) and after taking into account certain input variables X determining any difference S between the present category information as needed. The larger the difference, indicating that after the introduction of the input variable X, eliminate uncertainty, the variable classification greater role, it is called is a good split variable. In other words, to determine S, like any of this Category, we hope the information they need as little as possible, while the introduction of the input variable X can reduce the information needed classification, so say the input variable X is classified with the data mining tasks to gain information. The larger information gain, indicating the more important input variable X, and should therefore be considered a good split variables preference.

Therefore, the calculation of information gain the general idea is:

Under 1) first calculation does not consider the case of any input variables as to determine any category S belongs needs of the entropy Entropy (S);

2) computing to each input variable X to determine the entropy Entropy (X, S) S as present in any desired category belongs;

3) calculating the difference between the two, Entropy (S) - Entropy (X, S), that is, for the variable X can bring information (gain), referred to as Gain (X, S).

Combined with the above interpretation of entropy articles, we can come to seek entropy formula:

 

 The following illustration explains the meaning of the very image entropy represents.

 

 We are also at the top of a set of data to analyze, gain specific information should be considered how

 

 From the above discussion, we first calculated using the formula without considering any input attributes, the training set to determine the entropy S as in any category this belongs needed.

In this example, the target attribute i.e. buys_computer, there are two different values, yes, and no, and therefore there are two different categories (m = 2). Let P corresponding to the case where buys_computer = yes, N corresponds to the case buys_computer = no, the nine samples P, N 5 samples. Therefore, the total entropy is:

 

 That is, E (p, n) = E (9,5) = 0.940

Then we come to find property age entropy, age has three attributes, the number of samples were 5,4,5, so the age of the property entropy is:

 

 Finally, we can find the age of the attribute information gain is:

 

 Similarly, we can find the income, student and credit_rating information gain, respectively,

 

 finally, we can draw the attribute information gain maximum age, it should be used as the root attribute age.

 

After determining good roots, the next step we have to just follow the steps to determine the next node in the left and right sub-tree which attributes are used as roots, until finally come to a complete decision tree.

Although the decision tree classification algorithm can quickly predict the classification, but there will be problems over-fitting (Overfitting) of.

Some of the resulting decision tree is completely subordinated to the training set, too proper, that generated a lot of branches, some branches may be some special cases, the number of occurrences of small, non-representative, and some even only in training centralized, leading to low accuracy of the model.

Usually pruning way to overcome the overfitting, pruning there are two methods:

The first cut: trimming process of constructing a tree. Does not meet the conditions of the construction branch.

Trim the entire tree after generation: After cutting

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/hupc/p/11831307.html