R language study notes - decision tree classification

1. Introduction

The decision tree classification algorithm (decision tree) classifies samples with a certain characteristic attribute through a tree structure. Its typical algorithms include ID3 algorithm, C4.5 algorithm, C5.0 algorithm, CART algorithm and so on. Each decision tree includes a root node, an internal node, and a leaf node.

Root node: Represents the first feature attribute, with only outgoing edges but no incoming edges, usually represented by a rectangular frame.

Internal nodes: represent feature attributes, with one incoming edge and at least two outgoing edges, usually represented by a circle.

Leaf node: represents a category, with only one incoming edge and no outgoing edge, usually represented by a triangle.

The decision tree algorithm is mainly used for classification and can also be used for regression. When the output variable of the decision tree is a categorical variable, it is a classification tree; when the output variable of the decision tree is a continuous variable, it is a regression tree. Although the dependent variable of the regression tree is continuous, the number of leaf nodes is finite, so the output value is also the average of the observations on this leaf node.

2. Basic idea

The basic idea of ​​the decision tree algorithm can be summarized as follows: 

The first step is to divide the feature space according to feature attributes;

The second step is to recurse the first step on the classified subset.

The classification process seems very simple, but in order to obtain a [completely pure] subset, that is, the samples in each subset belong to the same classification, an evaluation index is needed to evaluate the classification effect - entropy.

 

Entropy, an unorganized and disordered state of a system, is widely used in information theory. The larger the entropy value, the lower the purity of the data. When the entropy is equal to 0, it means that the sample data are all of the same category. Its calculation formula is as follows:

Among them, P(Xi) represents the probability, and b is 2 here. The unit of entropy is bit.

 

Information gain: It represents the improvement effect of a division on data purity, that is, after division, the more the entropy decreases, the greater the gain, and the more valuable this division is . The calculation formula of the gain is as follows:

where D represents the sample set, assuming that the attribute a has v possible values ​​(discrete or continuous). When carrying out the most divided attributes, for example, first find attribute a, evaluate a, then repeat the process of a for other attributes, get a score respectively, and select the one with the highest score, that is, the one with the largest information gain as the most divided attribute . In short, the information gain is the difference between the entropy value before segmentation and the entropy value after segmentation.

It should be noted that calculating the sum of entropy of subsets needs to be multiplied by the weight of each subset, and the calculation method of weight is that the size of the subset accounts for the proportion of the parent set before the division. For example, the entropy before segmentation is e, the segmentation subsets are a and b, the sizes are m and n respectively, and the entropy is e1 and e2 respectively, then the information gain is e-e1*m/(m+n)-e2*n /(m+n).

3. Implementation of ID3 algorithm

The essence of classification is the division of feature space. According to the basic idea of ​​decision tree, its algorithm implementation mainly has the following three steps:

1. Select feature attributes, sample segmentation.

2. Calculate the information gain and select the maximum gain as the child node of the decision tree.

3. Recursively execute the previous two steps until the classification is completed.

The following will introduce the implementation process of ID3 algorithm in R language.

A set of 14-day weather data (indicators include outlook, temperature, humidity, windy), and whether these weathers are known to play. If the meteorological indicator data of the new day is given: sunny, cool, high, TRUE, judge whether you will go to play.

Outlook

temperature

humidity

windy

play

Sunny

hot

high

FALSE

no

Sunny

hot

high

TRUE

no

Overcast

hot

high

FALSE

yes

Rainy

mild

high

FALSE

yes

Rainy

cool

normal

FALSE

yes

Rainy

cool

normal

TRUE

no

Overcast

cool

normal

TRUE

yes

Sunny

mild

high

FALSE

no

Sunny

cool

normal

FALSE

yes

Rainy

mild

normal

FALSE

yes

Sunny

mild

normal

TRUE

yes

Overcast

mild

high

TRUE

yes

Overcast

hot

normal

FALSE

yes

Rainy

mild

high

TRUE

no

Without any given weather information, based on historical data, we only know that the probability of playing a new day is 9/14 and the probability of not playing is 5/14. The entropy at this time is:

There are 4 properties: outlook, temperature, humidity, windy. We first have to decide which attribute is the root node of the tree.

Count the number of times of playing and not playing in different weather conditions:

outlook

temperature

humidity

windy

play

 

yes

no

 

yes

no

 

yes

no

 

yes

no

yes

no

sunny

2

3

hot

2

2

high

3

4

FALSE

6

2

9

5

overcast

4

0

mild

4

2

normal

6

1

THINKS

3

3

   

rainy

3

2

cool

3

1

               

Code:

##ID3 algorithm of decision tree model

#read table data from clipboard
weather <- read.table("clipboard",T)
#View data structure
str(weather)
#Convert windy indicator to factor type
weather$windy <- as.factor(weather$windy)
# Information entropy before segmentation
q <- matrix(table(weather$play),nrow = 1,dimnames = list('play',unique(weather$play)))/
   sum(matrix(table(weather$play),nrow = 1))
e <- -sum(q*log2(q))
#Calculate the information entropy of each feature attribute
myfun <- function(x,y){
  t <- table(x,y)
  m <- matrix(t,nrow = length(levels(x)),2,dimnames = list(levels(x),levels(y)))
  #weight calculation
  n <- apply(m,1,sum)/sum(m)
  #Entropy of each subset after segmentation
  freq <- -rowSums((m/rowSums(m))*log2(m/rowSums(m)))
  # Final entropy after segmentation
  entropy <- sum(n*freq,na.rm = T)
  return(entropy)
}
#Calculate information gain information gain
y <- weather[,5]
gain <- vector()
for (i in 1:(length(weather)-1)) {
  x <- weather[,i]
  gain[i] <- e-myfun(x,y)
}
names(gain) <- colnames(weather[,1:4])
gain

  operation result:

According to the information gain comparison of each feature attribute, the outlook information gain is the largest, that is, the larger the entropy change is, so the root node of the decision tree should choose outlook.

Next, select the feature attributes of each child node N1, N2, N3.

After dividing the sample with the attribute outlook, the three attribute values ​​sunny, overcast, rainy divide the sample into three parts, and calculate gain (temperature), gain (humidity), and gain (windy) under each part.

Code:

#Select child node feature attributes
level <- levels(weather$outlook)
son_gain <- data.frame()
for(j in 1:length(level)){
  son_q <- matrix(table(weather[weather$outlook==level[j],]$play),nrow = 1,
                  dimnames = list('play',unique(weather$play)))/
    sum(matrix(table(weather[weather$outlook==level[j],]$play),nrow = 1))
  son_e[j]<- -sum(son_q*log2(son_q))
}
for (j in 1:length(level)) {
  for (i in 1:length(level)) {
    sl <- weather[weather$outlook==level[j],]
    son_x <- sl[,i+1]
    son_y <- sl[,5]
    son_gain[j,i] <- son_e[j]-myfun(x=son_x,y=son_y)
  }
}
colnames(son_gain) <- colnames(weather[,-c(1,5)])
rownames(son_gain) <- level
son_gain

  Information gain results:

According to the above results, on the Sunny branch, the information gain of the humidity is the largest, that is, the faster the entropy decreases, so the humidity should be taken as the characteristic attribute at the child node N1.

At this time, the final classification results of humidity=normal and humidity=high are the same category, and there is no need to reclassify under the humidity node, which is a leaf node.

Also the child node N3 under the rainy branch should select windy. And no further classification is required under this node.

Under the overcast branch, all samples belong to the same category, so this node is a leaf node.

The final classification tree is:

 

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325139925&siteId=291194637