Machine learning---decision tree algorithm (CLS, ID3, CART)

1. Decision tree

Decision Tree, also known as decision tree, is a tree structure used for classification. Each internal node

(internal node) represents a test of a certain attribute, each edge represents a test result, and the leaf node (leaf) represents a certain

For a class or a class distribution, the top node is the root node.

Decision trees provide a way to show rules such as what categories will be obtained under what conditions.

The following example is a decision tree built to solve this problem, from which you can see the basic components of the decision tree: decision nodes,

Branches and leaf nodes.

The figure below gives an example of a decision tree used in business. It shows whether a user who cares about electronic products will buy a computer.

Knowledge can be used to predict a certain record or a person's purchase intention.

This decision tree classifies sales records and indicates whether an electronics consumer will purchase a computer "buys_

computer". Each internal node (square box) represents a detection of a certain attribute. Each leaf node (oval box) represents a

Category: buys_computers=yes or buys_computers=no

In this example, the eigenvectors are:

(age, student, credit rating, buys_computers)

The format of the decision-making data is:

(age, student, credit rating)

Inputting a new record to be decided can predict which class the record belongs to.

Summary: A decision tree is a tree structure. It is essentially a tree composed of multiple judgment nodes, in which each internal node represents a

Judgment on attributes, each branch represents the output of a judgment result, and finally each leaf node represents a classification result.

2. CLS algorithm

The CLS (Concept Learning System) algorithm is an early decision tree learning algorithm. It is the basis of many decision tree learning algorithms.

foundation. The basic idea of ​​CLS is to start from an empty decision tree and select a certain attribute (classification attribute) as the test attribute. This test is

corresponds to the decision node in the decision tree. Depending on the value of this attribute, the training samples can be divided into corresponding subsets. If the subset

is empty, or the samples in the subset belong to the same class, then the subset is a leaf node, otherwise the subset corresponds to the internal node of the decision tree.

point, that is, the test node, you need to select a new classification attribute to divide the subset until all subsets are empty or belong to

Same category.

Problems with the CLS algorithm:

Using different test attributes and order will produce different decision trees.

 

3. ID3 algorithm 

ID3 decision tree establishment algorithm steps:

Determine the set of classification attributes;

For the current data table, create a node N;

If the data in the database all belong to the same class, N is the leaf, and the class to which it belongs (pure category) is marked on the leaf;

If there are no other attributes in the data table that can be considered, N is also a leaf, and the leaves are marked according to the principle of the minority obeying the majority.

Category (impure category);

Otherwise, select the best attribute as the test attribute of node N based on the average information expectation value E or GAIN value;

After a node attribute is selected, for each value in that attribute: Generate a branch from N and add the data related to that branch in the data table

Collect the data table that forms the branch node, and delete the node attribute column in the table;

If the branch data table attribute is not empty, go to the first step and use the above algorithm to create a subtree from the node.

Comparison of common decision tree algorithm heuristic functions:

Disadvantages of the ID3 algorithm:

(1) The ID3 algorithm uses information gain as the evaluation criterion when selecting branch attributes in the root node and each internal node. information gain

The disadvantage is that attributes with more values ​​tend to be selected, and in some cases such attributes may not provide much valuable information.

(2) The ID3 algorithm can only construct decision trees for data sets whose description attributes are discrete attributes.

4. C4.5 algorithm

Improvements of the C4.5 algorithm to the ID3 algorithm:

Improvement 1: Use information gain rate instead of information gain to select attributes

Improvement 2: Ability to discretize continuous-valued attributes

Improvement 3: Can handle missing attribute values

Improvement 4: Prune after the decision tree is constructed

Assume that the samples in D are divided by attribute A, and attribute A has v different values ​​{a1, a2, ..., aj, according to the observations of the training data,

...,av}. If A is a discrete value, D can be divided into v subsets {D1, D2, ..., Dj, ..., Dv} according to attribute A. Among them, DJ

is a subset of samples in D that have attribute values ​​aj on A. These divisions will correspond to the branches coming out of this node A.

The information gain metric is biased towards testing attributes with more values, that is, it tends to select attribute A with larger v.

As an extreme example: consider the attribute PID that acts as a unique identifier. Splitting the PID will produce a large number of partitions (as many as the number of samples

Many), each classification contains only one sample, and each partition is pure.

The information gain obtained by dividing the attribute PID is the largest. Obviously, this division is not useful for classification.

C4.5 uses split information to normalize the information gain and selects the attribute with the largest information gain rate as the split

Attributes.

Info(D) = 0.940 Info(D) = 0.911 Gain(income) = 0.029

There are 4 high-income people; 6 middle-income people; and 4 low-income people.

SplitInfo revenue (D) = - 4/14 * log4/14 - 6/14 * log6/14 - 4/14 * log4/14 = 1.557

GainRatio(income) = Gain(income) / SplitInfoincome(D) = 0.029 / 1.557 = 0.019 

For continuous value attributes, sort the attribute values ​​from small to large, and take the midpoint of each pair of adjacent values ​​as the possible split point split_point.

Assuming that a continuous-valued attribute has N different attribute values, N-1 possible split points can be found. Check every possible split point,

Take the split point that maximizes the information gain and split D into D1: A <= split_point and D2: A > split_point (a split

point, bisection, binary tree).

C4.5 does not use the midpoint, but directly uses the smaller value of a pair of values ​​as the possible split point. For example, in this example, 5, 6 will be used as the possible split point.

Can be divided.

In some cases, the data available may be missing values ​​for some attributes, e.g.

A simple approach is to give it the most common value for that property, such as "sunny" or "rain" for the weather property of the sixth instance.

A more complex strategy is to assign a probability to each possible value of A .

     Gain(A) = F (Info(D) – InfoA (D)) where F is the proportion of instances with missing attribute values; calculate Info(D) and

InfoA (D) ignores instances with missing attribute values.

Info(D) = -8/13×log(8/13) - 5/13×log(5/13) = 0.961 bits

InfoWeather(D) = 5/13×(-2/5log(2/5) - 3/5×log(3/5)) + 3/13×(-3/3log(3/3) - 0/ 3×log(0/3)) +

5/13×(-3/5log(3/5) - 2/5×log(2/5)) = 0.747 bits

Gain(weather) = 13/14 × (0.961 - 0.747) = 0.199 bits

When calculating SplitInfo, the missing attribute value is calculated as a normal value. In this example, the weather has four values, respectively.

Is it sunny, cloudy, rainy, or?, and then calculate its SplitInfo.

SplitInfo weather(D) = - 5/14×log(5/14) - 3/14×log(3/14) - 5/14×log(5/14) - 1/14×log(1/14) = 1.809 bits

GainRatio(weather) = Gain(weather) / SplitInfoweather(D) = 0.199 / 1.809

When splitting, instances with missing attribute values ​​are assigned to all branches, but with a weight:

In this example, a total of 13 of the 14 instances have weather attribute values ​​that are not missing: 5 of them have weather attributes of "sunny" and 3 of them have weather attributes of "sunny".

The attribute is "cloudy", and the weather attribute of 5 instances is "rain".

Among the 14 instances in this example, a total of 1 instance has a missing weather attribute value, so the sixth instance with a missing weather attribute value is estimated: The weather is sunny

The probability of is 5/13, the probability of the weather being cloudy is 3/13, and the probability of the weather being rainy is 5/13.

So the T1 situation can be divided into: humidity <= 75 2 play 0 no play

                                          Humidity >75 5/13 play 3 no play  

Leaf nodes are defined in the form of (N/E), where N is the number of instances reaching the leaf node, and E is the instances belonging to other categories.

number. For example, Don't Play (3.4/0.4) means that 3.4 instances reached the "Don't Play" node, of which 0.4 instances do not belong to "Don't Play".

For either instance, the probability that the humidity is <=75 is 2.0/(2.0 + 3.4), and the probability that the humidity is >75 is 3.4/(2.0 + 3.4).

When the humidity is <=75, the probability of classification as playing = 100% and the probability of classification as not playing = 0.

When humidity >75, the probability of classification as playing = 0.4/3.4=12% and the probability of classification as not playing = 3/3.4=88%.

The probability distribution of the final classification is: playing = 2.0/5.4×100% + 3.4/5.4×12% = 44%, not playing = 3.4/5.4×88% = 56%

The above decision tree algorithm increases the depth of each branch of the tree until it can classify the training examples perfectly.

In practical applications, when there is noise in the training samples or the number of training samples is too small to produce a representative representation of the objective function,

This strategy may encounter difficulties when sampling. When the above situation occurs, the tree produced by this simple algorithm will overfit the training samples.

Example (overfitting: Overfitting). Causes of overfitting: There is noise in the training samples, the training samples are too small, etc.

Advantages and Disadvantages of C4.5:

Advantages: The generated classification rules are easy to understand and have high accuracy.

Disadvantages: In the process of constructing the tree, the data set needs to be scanned and sorted multiple times, resulting in inefficiency of the algorithm. this

In addition, C4.5 is only suitable for data sets that can reside in memory. When the training set is too large to be accommodated in memory, the program cannot run.

5. CART algorithm

Classification and Regression Tree (CART: Classification and Regression Tree) is characterized by making full use of bisection in the calculation process.

The structure of the branch tree (Bianry Tree-structured), that is, the root node contains all samples, and the root node is divided under certain splitting rules.

Split into two child nodes, this process is repeated on the child nodes until it cannot be divided anymore and becomes a leaf node. Use GINI to refer to

Markers are used to select splitting attributes, binary splitting is used (a binary tree will be generated), and cost-complexity pruning is used.

Algorithm description: where T represents the current sample set, and the current candidate attribute set is represented by T_attributelist.

(1) Create root node N (2) Assign category to N

(3) If T all belong to the same category or there is only one sample left in T, return N as a leaf node, otherwise assign attributes to it

(4) For each attribute in T_attributelist, perform a division on the attribute and calculate the GINI coefficient of this division.

(5) N’s test attribute test_attribute=the attribute with the smallest GINI coefficient in T_attributelist

(6) Divide T to obtain T1 and T2 subsets

(7) Repeat (1)-(6) for T1

(8) Repeat (1)-(6) for T2

The CART algorithm considers that each node has the possibility of becoming a leaf node, and assigns a category to each node. Methods for assigning categories can be

To use the category that appears most in the current node, you can also refer to the classification error of the current node or other more complex methods.

The Gini index is the smallest, and the division is more pure. The attribute with the smallest Gini index (or the largest ΔGini) is selected as the splitting attribute.

Handling discrete-valued attributes: Taking income as an example, all possible subsets of the income attribute: {low, medium, high}, {low, medium}, {low,

High}, {medium, high}, {low}, {medium}, {high}. Considering all possible binary partitions and calculating the Gini index before and after the partition, choose

The subset that can produce the smallest Gini index is used as the split subset.

6. Recursive segmentation (greedy algorithm)

Starting from the root node, consider a splitting variable j and splitting point s, and get two regions. The optimal variable j and splitting point s must satisfy

For given j and s, the solution of the innermost optimization problem is:

For a given j, the split point s can be found quickly.

In this way, by traversing all independent variables, we can find the best pair of j and s.

 

 

 

 

 

Guess you like

Origin blog.csdn.net/weixin_43961909/article/details/132537620