Decision Tree Essay

Principle of decision tree

Decision tree is a multi-functional machine learning algorithm, which can realize classification and regression tasks, and even multi-output tasks. They are powerful and can fit complex data sets.

  • -Advantages: simple and intuitive, basically no pre-processing, no need for conferences, handling missing values, high precision, insensitive to outliers, no data input assumptions. Can handle discrete or continuous values, and can handle multi-dimensional output classification problems
  • Disadvantages: high computational complexity, easy to overfit, and poor generalization ability. It may be that the structure of the tree changes drastically due to a little modification of the sample, and the space complexity is high.
  • Applicable data range: numeric type and nominal type.
    The pseudo-code function createBranch () that creates a branch is as follows:
    Detect whether each sub-item in the data set belongs to the same category:
If so return 类标签;
Else 
 		寻找划分数据集的最好特征
 		划分数据集
 		创建分支节点
 				for 每个划分的子集
 						调用函数createBranch并增加返回结果到分支节点中
		 return 分支节点

The pseudocode createBranch above is a recursive function that directly calls itself on the penultimate line.

Its working principle is very simple. The flow chart shown in the figure is a decision tree. The rectangle represents the decision block and the ellipse represents the terminating block, indicating that a conclusion has been reached and the operation can be terminated. The left and right arrows drawn from the judgment module are called branches, which can reach another judgment module or terminate the module.
Insert picture description here
The k-nearest neighbor algorithm can complete many classification tasks, but its biggest disadvantage is that it cannot give the inherent meaning of the data. The main advantage of the decision tree is that the data form is very easy to understand. Decision tree requires very little data preparation work, in particular, there is no need for feature scaling or concentration at all.

(1) 收集数据:可以使用任何方法。
(2) 准备数据:树构造算法只适用于标称型数据,因此数值型数据必须离散化。
(3) 分析数据:可以使用任何方法,构造树完成之后,我们应该检查图形是否符合预期。
(4) 训练算法:构造树的数据结构。
(5) 测试算法:使用经验树计算错误率。
(6) 使用算法:此步骤可以适用于任何监督学习算法,而使用决策树可以更好地理解数据
的内在含义。

Make predictions

Taking the iris classification as an example (there are four features and three categories), the decision boundary of the decision tree is shown in the figure. The bold line indicates the decision boundary of the root node (depth 0): petal length = 2.45 cm. Because the area on the left is pure (only Setosa iris), it cannot be divided. But the area on the right is impure, so the node on the right of depth 1 splits again at the petal width = 1.75 cm (shown by the dotted line). Because the maximum depth max_depth is set to 2, the decision tree stops here. But if you set max_depth to 3, then two nodes with depth 2 will each generate another decision boundary (shown by dotted lines).Insert picture description here

Decision tree classification

1.CLS algorithm

CLS (Concept Learning System) learning algorithm was proposed by Hunt.EB and others in 1966. It first proposed the use of decision trees for concept learning. Later, many decision tree learning algorithms can be regarded as the improvement and update of the CLS algorithm. The main idea of ​​CLS is to start from an empty decision tree and improve the original decision tree by adding new decision nodes until the tree can correctly classify the training examples. The process of constructing the decision tree is also the process of assuming specialization, so CLS can be regarded as a learning algorithm with only one operator. This operation can be expressed as: Specialization of the current hypothesis by adding a new decision condition (new decision node). The CLS algorithm calls this operator recursively, acting on each leaf node to construct a decision tree.

2. ID3 algorithm

The ID3 (Iterative Dichotomizer3) algorithm was proposed by Quinlan in 1986. It is the representative of the decision tree algorithm, and most decision tree algorithms are improved based on it. It uses a divide-and-conquer strategy. When selecting attributes at all levels of the decision tree, the information gain is used as the attribute selection criterion, so that when testing on each non-leaf node, the specific method of obtaining the largest category information about the tested record can be obtained. Yes: detect all attributes, select the attribute with the largest information gain to generate a decision tree node, establish branches from different values ​​of this attribute, and then recursively call this method on a subset of each branch to establish branches of the decision tree node until all subsets Only contains data of the same category. Finally, a decision tree is obtained, which can classify new samples.

3.C4.5 algorithm

The C4.5 algorithm was proposed by Quinlan.JR in 1993. It evolved from the ID3 algorithm and inherited the advantages of the ID3 algorithm. The C4.5 algorithm introduced new methods and functions:

  • Use the concept of information gain to overcome the shortcomings of biasing to multi-valued attributes when selecting attributes with information gain;
  • Pruning during tree construction to avoid overfitting the tree;
  • Ability to discretize continuous attributes;
  • Can handle training sample sets with missing attribute values;
  • Ability to process incomplete data;
  • K-fold cross-validation;

The C4.5 algorithm reduces the calculation complexity and enhances the calculation efficiency. Its important improvement to the ID3 algorithm is to use the information gain rate to select attributes. Theories and experiments show that the use of information gain rate is better than the use of information gain, mainly because it overcomes the ID3 method's attribute of choosing more values. C4. 5 algorithm has also dealt with the data of continuous value attribute, making up for the defect that the ID3 algorithm can only handle discrete value attribute data.
However, the C4.5 algorithm pays a great price for processing the linear search threshold in continuous test attributes. In 2002, Salvatore Ruggieri proposed an improved algorithm for C4.5: EC4.5 algorithm. The EC4.5 algorithm uses binary search instead of linear search to overcome this shortcoming. Experiments show that when generating the same decision tree, compared with the C4.5 algorithm, the EC4.5 algorithm can improve the efficiency by 5 times, but the disadvantage of the EC4.5 algorithm takes up more memory. 5. SLIQ algorithm The above algorithm requires training sample sets to reside in memory, so it is not suitable for processing large-scale data. To this end, IBM researchers proposed a faster, scalable, decision tree classification algorithm SLIQ (Supervised Learning In Quest) algorithm suitable for processing larger-scale data in 1996. It uses the attribute table, class table and class histogram to build a tree. The attribute table contains two fields: attribute value and sample number. Class table also contains

4. CART training algorithm

Scikit-Learn uses the Classification And Regression Tree (CART) algorithm to train decision trees (also known as "growth" trees). The idea is very simple: first, use a single feature k and a threshold tk (for example, petal length ≤ 2.45 cm) to divide the training set into two subsets. How to choose k and threshold tk? The answer is that k and tk, which produce the purest subset (weighted by its size), are determined by algorithm search (t, tk).
Insert picture description here
Once the training set is successfully divided into two, it will use the same logic to continue to split the subset, and then the subset of the subset, and iterate in turn. Until the maximum depth is reached (explained below by the hyper parameter max_depth control). Reducing max_depth can regularize the model, thereby reducing the risk of overfitting. ), Or can no longer find a split that can reduce impurity, it will stop.

Computational complexity

Making predictions requires traversing the decision tree from root to leaf. Generally speaking, the decision tree is roughly balanced, so traversing the decision tree requires about O (log2 (m)) nodes. (Note: log2 is the logarithm of the base 2. It is equal to log2 (m) = log (m) / log (2).) And each node only needs to check one eigenvalue, so the overall prediction complexity is only O ( log2 (m)), regardless of the number of features. Thus, even when dealing with large data sets, predictions are fast. However, at each node during training, the algorithm needs to compare all features on all samples (if max_features is set it will be less). This leads to a training complexity of O (n × m log (m)). For small training sets (within thousands of instances), Scikit_learn can speed up the training by preprocessing the data (setting presort = True), but for larger training sets, it may slow down the training.

Regularized hyperparameter

Decision trees rarely make assumptions about training data (for example, the linear model is just the opposite, it obviously assumes that the data is linear). If not restricted, the structure of the tree will change with the training set, fit closely, and likely overfit. This kind of model is usually called a non-parametric model. This does not mean that it does not contain any parameters (in fact, it usually has many parameters), but it means that the number of parameters is not determined before training, resulting in the model structure being free and close to the data . Corresponding parameter models, such as linear models, have some preset parameters, so their degrees of freedom are limited, thereby reducing the risk of overfitting (but increasing the risk of underfitting). It can be optimized as follows.

1. Limit branches and leaves

To avoid overfitting, it is necessary to reduce the degree of freedom of the decision tree during the training process. This process is called regularization. The choice of regularization hyperparameters depends on the model you use, but in general, it can at least limit the maximum depth of the decision tree. Hyper parameter max_depth control (default is None, meaning unlimited) Reducing max_depth can regularize the model, thereby reducing the
risk of overfitting .
parameter

  • min_samples_split (the minimum number of samples a node must have before splitting)
  • min_samples_leaf (the minimum number of samples a leaf node must have)
  • min_weight_fraction_leaf (same as min_samples_leaf, but expressed as a percentage of the total number of weighted instances)
  • max_leaf_nodes (maximum number of leaf nodes)
  • max_features (split the maximum number of features evaluated by each node). Increase hyperparameter
  • Min_ * or decreasing max_ * will regularize the model.

2. Pruning and leaf cutting

You can also train the model without constraints before
pruning (deleting) unnecessary nodes . If the child nodes of a node are all leaf nodes, the node can be considered unnecessary unless the purity improvement it represents has important statistical significance. Standard statistical tests, such as the χ2 test, are used to estimate the probability that “the improvement is purely accidental” (known as a false hypothesis). If this probability (called p-value) is higher than a given threshold (usually 5%, controlled by hyperparameters), then this node can be considered unnecessary and its child nodes can be deleted. Until all unnecessary nodes are deleted, the pruning process ends.

Obviously, the left picture is unlimited, but overfitting; the right picture is limited by min_samples_leaf, and its generalization effect is better.
Insert picture description here

return

Decision trees can also be used for regression analysis. Unlike classification, regression predicts a value rather than a category.
For example, if you want to x 1 = 0.6 x_1=0.6 new instance of the forecast, then traversed from the root node, and finally to the forecast value = 0.1106 of a leaf node. This prediction result is actually the average target value of 110 instances associated with this leaf node. In these 110 examples, the mean square error (MSE) generated by the prediction is equal to 0.0151.
Insert picture description here
Like classification tasks, decision trees are also prone to overfitting when dealing with regression tasks.
Insert picture description here

problem

1. If the training set has 1 million instances, what is the approximate depth of the training decision tree (unconstrained)?
Answer: The depth of a balanced binary tree containing m leaf nodes is equal to log2 (m) (Note: log2 represents a log function with base 2 and log2 (m) = log (m) / log (2).) Generally speaking, the binary decision tree training is generally balanced until the end. If not restricted, the average of each leaf node is one instance. Therefore, if the training set contains one million instances, then the depth of the decision tree is approximately equal to log2 (106) ≈ 20 layers (in fact, it will be more because the decision tree is usually impossible to perfectly balance).

2. If the decision tree overfits the training set, is it a good idea to reduce max_depth?
Answer: It may be a good idea to reduce the max_depth because it will limit the model and make it regular.

3. If the decision tree does not fit the training set well, is it a good idea to try to scale the input features?
Answer: One of the advantages of decision trees is that they do not care whether the training data is scaled or concentrated, so if the decision tree is not well-fitted to the training set, scaling the input features is just a waste of time.

4. If it takes an hour to train a decision tree on a training set containing 1 million instances, how long does it take to train a decision tree on a training set containing 10 million instances?
Answer: The training complexity of the decision tree is O (n × mlog (m)). Therefore, if the training set size is multiplied by 10, the training time will be multiplied by K = (n × 10m × log (10m)) / (n × m × log (m)) = 10 × log (10m) / log (m ). If m = 106, then K≈11.7, so it takes about 11.7 hours to train 10 million instances.

Note

Since I am a beginner, I still do n’t know much about the algorithm parameters, so I borrow a lot from the excerpts and merge them, so I do n’t have a famous source here. I didn't write much code, mainly because I felt that writing would make the space too large, and I didn't write much reasoning in formulas.
In addition, this article mainly refers to the book "Practical Machine Learning: Based on Scikit-Learn and TensorFlow". If there is a demand (there is still a lot of information about Python and machine learning), you can follow and privately mail me, and they will be delivered one by one.
After all, novices, if you have mistakes, you still want to enlighten and support, do n’t like to spray, and hope to find friends who write like-minded and learn together.

Published 3 original articles · won 13 · views 185

Guess you like

Origin blog.csdn.net/weixin_45755332/article/details/105523094