Knowledge Points|Machine Learning-Decision Tree (Decision Tree)

introduce

Decision tree is a type of prediction model, which represents a mapping relationship between object attributes and object values. Each node in the tree represents an object, and each bifurcation path represents a possible attribute value, and each leaf node corresponds to the value of the object represented by the path from the root node to the leaf node.

Regarding classification issues

Here we mainly consider the processing algorithm of decision tree based on classification problem. There is a simple way to distinguish between classification problem and regression problem: the target attribute of classification is discrete, while the target attribute of regression is continuous.

Steps to classify problems

1. Model construction: Through the induction of the training set, the induction algorithm is used to generate readable rules and establish a classification model.

2. Prediction and inference: According to the rules and the established classification model, test the test set and process new data.

About induction algorithms

  • Induction is the process from the specific to the general, and the process of induction is the process of searching in the description space.
  • Induction can be divided into three methodstop-down, bottom-up and two-way search.
    • Bottom-up method: process one input object at a time and gradually generalize the description until the final generalized description.
    • Top-down method: Search the general description set to find the optimal description that meets certain requirements.
  • Inductive algorithms are at the heart of decision tree technology for discovering patterns and rules in data.
  • Inductive learning relies on test data, so it is also called test learning.
  • Inductive reasoning attempts to obtain a complete and correct description from specific observations of a part or the whole of an object. That is, from partial facts to conclusions of universal laws.

Basic assumptions of induction

There is a basic assumption in inductive learning: if any hypothesis can approximate the objective function well in a large enough training sample set, it can also approximate the objective function well in the test sample.

This assumption is a prerequisite for the effectiveness of inductive learning.

Decision tree model

Classification decision tree is a tree structure that describes the classification of instances. The decision tree is composed of nodes and directed edges. The nodes can be divided into internal nodes and leaf nodes. The internal nodes represent a feature or attribute. Leaf nodes represent a category, usually using circles to represent internal nodes and boxes to represent leaf nodes.

Decision tree classification starts from the root node, tests a certain feature of the instance, and assigns the instance to its sub-nodes based on the test results. Each sub-node corresponds to a value of the feature, and so on until it reaches the leaf node. point.

The core issue in constructing a decision tree is how to select appropriate attributes to split the sample at each step. For a classification problem, learning and constructing a decision tree from training samples with known class labels is a top-down, divide-and-conquer process.

if-then rules for decision trees

A rule is constructed from each path from the root node to the leaf node of the decision tree: the characteristics of the internal nodes on the path correspond to the conditions of the rule, and the class of the leaf node corresponds to the conclusion of the rule.

The path of the decision tree and its corresponding set of if-then rules have an important property: they are mutually exclusive and complete. This means that every instance is covered by one path or rule, and only one path or rule.

Decision trees and conditional probability distributions

Decision trees represent the conditional probability distribution of a class given the characteristics. The conditional probability distribution is defined on a division of the feature space. Dividing the feature space into disjoint units or regions, and defining a class probability distribution in each unit constitutes a conditional probability distribution.

A path in the decision tree corresponds to a unit in the partition. The conditional probability distribution represented by the decision tree consists of the conditional probability distribution of the class under given conditions of each unit.

Decision tree learning essentially summarizes a set of classification rules from the training data set. There may be multiple decision trees that can correctly classify the training data, or there may be none. What we need is a decision tree that is less inconsistent with the training data and has better generalization ability. That is, the conditional probability model should not only fit the training data well, but also predict the unknown data well.

Decision tree algorithm

Four important algorithms: CLS, ID3, C4.5, CART. ID3 uses the information gain selection feature, and the higher the gain, the higher the priority. In C4.5, the information gain rate is used to select features to reduce the problem of large information gain caused by too many feature values. The CART classification tree algorithm uses the Gini coefficient to select features. The Gini coefficient represents the impurity of the model. The smaller the Gini coefficient , the lower the impurity, the better the characteristics. This is the opposite of information gain (rate).

Algorithm process:

  • 1966, CLS Learning System
  • In 1979, CLS was simplified and ID3 was obtained
  • In 1984, CART algorithm
  • In 1986, based on ID3, create a node buffer and get ID4
  • In 1988, based on ID4 and optimizing efficiency, ID5 was obtained
  • In 1993, ID3 was improved and C4.5 was obtained

Decision tree CLS algorithm

CLS (Concept Learning System) algorithm is the basis of many decision tree learning algorithms

Basic idea

The basic idea of ​​CLS is to start from an empty decision tree and select a certain classification attribute as the test attribute. This test attribute corresponds to the decision node in the decision tree. Depending on the value of this classification attribute, the training samples can be divided into corresponding subsets.

If the subset is empty or the samples in the subset belong to the same class, the subset is a leaf node. Otherwise, the subset corresponds to the internal node of the decision tree, that is, the test node, and a new classification attribute needs to be selected to divide the subset until all subsets are empty or belong to the same category.

Algorithm steps

1. Generate an empty decision tree and a training sample attribute set;

2. If all samples in the training sample set T belong to the same category, generate node T and terminate the learning algorithm; otherwise, continue;

3. Select attribute A from the training sample attribute table as the test attribute according to a certain strategy, and generate test node A;

The composition of the test attribute set and the order of the test attributes have a decisive impact on the learning of the decision tree.

4. If the value of A is v1, v2,...,vm, then T is divided into m subsets T1, T2,..., Tm according to the different values ​​of A;

5. Delete attribute A from the training sample attribute table, go to step 2, and call CLS recursively on each subset.

Algorithmic thinking

In practical applications, it can be found that the composition of the test attribute set and the sequence of the test attributes have a decisive impact on the learning of the decision tree. Different features and different selection orders will generate different decision trees, so the selection of features is particularly important.

Different features and different selection orders will generate different decision trees

So, how to select features? This will be further tried in the next ID3 algorithm.

Decision tree ID3 algorithm

The ID3 algorithm mainly targets the attribute selection problem. It is the most influential and typical algorithm among decision tree learning methods.

Basic idea

Based on the basic idea of ​​CLS, the ID3 algorithm selects features through information gain.

When obtaining information, uncertainty needs to be transformed into certainty, so information is accompanied by uncertainty. To a certain extent, low-probability events contain more information than high-probability events. If something is "once in a hundred years", it must contain more information than "usual" events. So how do you measure the amount of information? This involves concepts in information theory.

The concept of entropy

Entropy (entropy): A measure of the amount of information, and also a measure of the uncertainty of random variables.

Popular explanation of entropy: The amount of information of event Ail(A_i ) can be expressed as: l(A_i)=p(A_i)log_2\frac{1}{p(A_i)}, where a>p(A_i) represents the probability of event Ai occurring.

Theoretical explanation of entropy: Suppose X is a discrete random variable taking a finite number of values, and its probability distribution is

P(X_i=x_i)=p_i,i=1,2,..,n

Then the entropy of random variable X is:

H=-\sum_{i=1}^{n}p_ilogp_i

0\leq H\leq logn

The logarithm has base 2 or base e, and the unit of entropy is called bit or nat respectively.

Entropy only depends on the distribution of X and has nothing to do with the value of X.

The greater the entropy, the greater the uncertainty of the corresponding random variable.

When the current X is 0, 1 distribution,P(X=1)=p,P(X=0)=1-p,0\leq p\leq 1

H=-plog_2p+-(1-p)log_2(1-p)

The change of H with p can be represented by a diagram:

Conditional entropy:H(Y|X)=\sum_{i=1}^{n}p_iH(Y|X=x_i), represents the uncertainty of random variable Y under the condition of known random variable X, defined as X given The entropy of the conditional probability distribution of Y versus the mathematical period of X under the condition.

When the probabilities in entropy and conditional entropy are obtained from data estimation (especially maximum likelihood estimation), the corresponding entropy and conditional entropy are called empirical entropy and empirical conditional entropy respectively.

information gain

Information gain: The information gain g(D,A) of feature A on training data set D, defined as the empirical entropy H(D) of set D and the empirical conditional entropy H(D) of D under given conditions of feature A |A) difference, that isg(D,A)=H(D)-H(D|A)

Information gain represents the degree to which the uncertainty of information of type Y is reduced by knowing the information of feature X. Generally speakingH(Y)-H(Y|X) is called mutual information, which is the Information gain is equivalent to the mutual information between classes and features in the training data set.

information gain algorithm

Input: training data set D and feature A;

1. Calculate the empirical entropy of data set D:

H(D)=-\sum_{k=1}^{K}\frac{|C_k|}{|D|}log_2\frac{|C_k|}{|D|}

2. Calculate the empirical conditional entropy H(D|A) of feature A to data set D

H(D|A)=\sum_{I=1}^{n}\frac{|D_i|}{|D|}H(D_i)=-\sum_{I=1}^{n}\frac{|D_i|}{|D|}\sum_{k=1}^{K}\frac{|D_{ik}|}{|D|}log_2\frac{|D_{ik}|}{|D|}

3. Calculate information gain

g(D,A)=H(D)-H(D|A)

Output: Information gain of feature A on training data set Dg(D,A).

  • |Ck | is the number of samples belonging to class Ck
  • Feature A has n different values ​​{a1, a2...an}. According to the value of feature A, D is divided into n subsets D1...Dn
  • The sample set belonging to class Ck in subset Di is Dik

ID3 algorithm

1. Starting from the root node, calculate the information gain of all possible features, and select the feature with the largest information gain as the dividing feature of the node;

2. Create sub-nodes based on different values ​​of the feature;

3. Recurse 1-2 steps on the child nodes to build a decision tree;

4. Until no features can be selected or the categories are exactly the same, the final decision tree is obtained.

Algorithmic thinking

The ID3 algorithm uses information entropy as a measure for attribute selection of decision tree nodes. Each time, the attribute with the most information is selected first, that is, the attribute that can minimize the entropy value, so as to construct a decision with the fastest decrease in entropy value. Tree, the entropy value to the leaf node is 0. At this time, the instances in the instance set corresponding to each leaf node belong to the same class.

Decision tree C4.5 algorithm

The information gain selection feature is used in ID3, and the larger gain is preferred. In C4.5, the information gain rate is used to select features to reduce the problem of large information gain caused by too many feature values. Using information gain as a feature to divide the training data set tends to be biased towards selecting features with more values. This problem can be corrected by considering the information gain ratio.

information gain ratio

The information gain ratio of feature A to training data set D is defined as the ratio of the information gain to the entropy of training data set D with respect to the value of feature A.

g_R(D,A)=\frac{g(D,A)}{H_A(D)}

whereH_A(D)=-\sum_{i=1}^{n}\frac{|D_i|}{|D|}log_2\frac{|D_i|}{|D|}, n is the number of values ​​of feature A

Decision tree CART algorithm

The decision trees generated by ID3 and C4.5 algorithms are multi-trees, which can only handle classification but not regression. The CART (classification and regression tree) classification and regression tree algorithm can be used for both classification and regression. The output of the classification tree is the category of the sample, and the output of the regression tree is a real number.

Algorithm composition

  • Decision tree generation
  • Decision tree pruning

CATR tree

  • The target variable is categorical - classification tree: Gini indicator, Towing, order Towing
  • The target variable is continuous - regression tree: least square residual, least absolute residual

Gini Coefficient

The purity of data set D can be measured by the Gini value

Gini(D)=\sum_{i=1}^{n}p(x_i)(1-p(x_i))=1-\sum_{i=1}^{n}p(x_i)^{2}

Among them, p(xi) is the probability of occurrence of category xi, and n is the number of categories. Gini(D) reflects the probability that two samples randomly selected from the data set D have inconsistent class labels. Therefore, the smallerGini(D) is, the higher the purity of data set D.

For sample D, the sample capacity is |D|. According to whether feature A takes a certain possible value a, sample D is divided into two parts, D1 and D2. Therefore, the CART classification tree algorithm builds a binary tree instead of a multi-tree.

Under the condition of attribute A, the Gini coefficient of sample D is defined as

Giniindex(D|A=a)=\frac{|D_1|}{|D|}Gini(D_1)+\frac{|D_2|}{|D|}Gini(D_2)

Advantages of decision trees

1. The decision-making reasoning process can be expressed in the form of If-Then;

2. The reasoning process completely depends on the value characteristics of attribute variables;

3. Attribute variables that do not contribute to the target variable can be ignored.

Guess you like

Origin blog.csdn.net/weixin_73404807/article/details/133769224