"Machine Learning in Practice: Based on Scikit-Learn, Keras and TensorFlow 2nd Edition" - Study Notes (6): Decision Tree

· Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, by Aurélien Géron (O'Reilly). Copyright 2019 Aurélien Géron, 978-1-492-03264-9.
· Environment: Anaconda (Python 3.8 ) + Pycharm
· Study time: 2022.05.04~2022.05.05

Chapter 6 Decision Trees

Like SVMs, decision trees are general-purpose machine learning algorithms that can perform classification and regression tasks, even multi-output tasks. They are powerful algorithms capable of fitting complex datasets. For example, in Chapter 2, you trained a DecisionTreeRegressor model on the California houses dataset to fully fit (actually overfit).

Decision trees are also a fundamental building block of random forests (see Chapter 7), and they are one of the most powerful machine learning algorithms available today.

In this chapter, we'll start by discussing how to use decision trees for training, visualization, and making predictions. Then, we'll look at the CART training algorithm used by Scikit-Learn, and we'll discuss how to regularize trees and use them for regression tasks. Finally, we will discuss some limitations of decision trees.

6.1 Training and Visualizing Decision Trees

To understand decision trees, let's build a decision tree and see how it makes predictions. The following code trains a DecisionTreeClassifier on the iris dataset:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data[:, 2:]  # petal length and width
y = iris.target
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X, y)

To visualize the decision tree, first, use the export_graphviz()method to output a graph definition file named iris_tree.dot:

import os
from sklearn.tree import export_graphviz

export_graphviz(
    tree_clf,
    out_file=os.path.join("iris_tree.dot"),
    feature_names=iris.feature_names[2:],
    class_names=iris.target_names,
    rounded=True,
    filled=True
)

You can then convert this .dot file to a variety of formats, such as PDF or PNG, using the dot command-line tool in the Graphviz package. This command line converts .dot files to .png image files:

$ dot -Tpng iris_tree.dot -o iris_tree.png

PyCharm installs a .dot plugin just fine.

Draw as shown:

insert image description here

6.2 Making predictions

Let's see how the tree in the image above makes predictions. Suppose you find an iris flower and want to classify it. You start at the root node (depth 0, at the top): this node asks if the flower's petal length is less than 2.45cm. If it is, move down to the left child of the root (depth 1, left). In this case, it's a leaf node (i.e. it doesn't have any children), so it doesn't ask any questions: just look at the predicted class for that node, and the decision tree predicts that the flower is iris (class =setosa).

Now suppose you find another flower, this time the length of the petal is greater than 2.45cm, you have to move down to the right child of the root (depth 1, right), this child is not a leaf, so the node asks another One question: Is the petal width less than 1.75cm? If so, your flowers are most likely iris versicolor (depth 2, left). If not, it might be Virginia iris (depth 2, right). It's that simple.

One of the many qualities of decision trees is that they require little data preparation. In fact, they don't require feature scaling or centering at all .

The node's samples property counts the number of training instances it applies to. For example, there are 100 training instances with petals longer than 2.45cm (depth 1, right) and 54 of them with petal widths less than 1.75cm (depth 2, left).

The value attribute of a node describes the number of training instances for each class on that node. For example, the lower right node is applied on 0 Iris montane, 1 Iris versicolor, and 45 instances of Iris virginica.

Finally, a node's gini property measures its impurity: a node is "pure" (gini=0) if all training instances of the application belong to the same class. For example, the node to the left of depth 1 is only applied to the iris training instance, so it is pure and has a gini value of 0. Equation 6-1 illustrates how the Gini coefficient Gi of the ith node is calculated. For example, for the left node of depth 2, the Gini coefficient is equal to 1 – ( 0 / 54 ) 2 – ( 49 / 54 ) 2 – ( 5 / 54 ) 2 ≈ 0.168 1–(0/54)^2–(49/54) ^2–(5/54)^2≈0.1681 ( 0 / 5 4 )2(49/54)2(5/54)20.168 . _ _ _ _ (Gini impurity:G i = 1 − ∑ k = 1 npi , k 2 G_i = 1-\sum^n_{k=1}p_{i,k}^2Gi=1k=1npi,k2pi , k pi, kpi,k isthe secondkk between training instances in i nodesratio of instances of class k )

Scikit-Learn uses the CART algorithm , which only generates binary trees: non-leaf nodes always have only two children (ie, the answer to a question is only yes or no). However, for other algorithms (such as decision trees generated by ID3 ), a node can have more than two child nodes.

Model Interpretation: White Box vs Black Box

Decision trees are intuitive and their decisions are easy to explain. Such models are often referred to as white-box models. Instead, as we will see, random forests or neural networks are often considered black-box models. They make good predictions, and you can easily check the calculations they performed to make those predictions. However, it is often difficult to explain why such predictions are made in simple words. For example, if a neural network says that a person appears in a picture, it's hard to know what factors contributed to that prediction: does the model recognize that person's eyes, mouth, nose, shoes, or even the sofa they're sitting on? In contrast, decision trees provide nice, simple classification rules that can even be applied manually if desired (for example, for flower classification).

6.3 Estimating Class Probabilities

The decision tree can also estimate the probability that an instance belongs to a specific class k: first, follow the decision tree to find the leaf node of the instance, and then return the proportion of training instances of class k in the node. For example, let's say you find a flower whose petals are 5cm long and 1.5cm wide. The corresponding leaf node is the left node of depth 2, so the decision tree outputs the following probabilities: iris iris, 0% (0/54); iris versicolor, 90.7% (49/54); iris virginia, 9.3 % (5/54). Of course, if you ask it to predict the class, it should output Iris versicolor (class 1), since it has the highest probability. Let's try:

tree_clf.predict_proba([[5, 1.5]])
# 输出:array([[0. , 0.90740741, 0.09259259]])
tree_clf.predict([[5, 1.5]])
# 输出:array([1])

Perfect! Note that in the lower right rectangle of Figure 6-2, the estimated probability is the same for any point, for example, if the petal is 6cm long and 1.5cm wide (even though it looks like it might well be a Virginia iris).

6.4 CART training algorithm

Scikit-Learn uses the Classification and Regression Tree (CART) algorithm to train decision trees (also known as "growing trees"). The algorithm works by first dividing the training set into two subsets using a single feature k and a threshold tk (e.g. "petal length" ≤ 2.45cm"). How to choose kkk andtk t_ktk? It searches for a pair ( kk that yields the purest subset (weighted by its size)k andtk t_ktk).

The formula below gives the cost function that the algorithm tries to minimize ( G left / right G_{left/right}Gleft/rightmeasure the impurity of the left and right subsets, mleft / right m_{left/right}mleft/rightmeasure the number of instances of the left and right subsets).
J ( k , tk ) = mleftm G left + mrightm G right J(k,t_k) = \frac{m_{left}}{m}G_{left} + \frac{m_{right}}{m}G_{ right}J(k,tk)=mmleftGleft+mmrightGright
Once the CART algorithm successfully splits the training set into two parts, it uses the same logic to split the subset, then the subset, and so on. It stops recursion once the maximum depth (defined by the hyperparameter max_depth) is reached, or no splits that reduce the impurity are found. Some other hyperparameters (described later) control some other stopping conditions (min_samples_split, min_samples_leaf, min_weight_fraction_leaf, and max_leaf_nodes).

As you can see, CART is a greedy algorithm: it searches for the optimal split starting at the top layer, and then repeats the process for each layer. After several layers of splits, it does not check whether the impurity of this split is the lowest possible value. Greedy algorithms usually produce a fairly good solution, but are not guaranteed to be optimal.

And unfortunately, finding the optimal tree is a known NP-complete problem (NP-C problem): the time required is O(exp(m)) O(exp(m))O ( e x p ( m ) ) , so even for a small training set, it's pretty tricky. That's why we have to accept a "pretty good" solution.

NP is the problem of Non-deterministic Polynomial, which is a non-deterministic problem of polynomial complexity. If any NP problem can be transformed into an NP problem by a polynomial time algorithm, then the NP problem is called a non-deterministic Polynomial complete problem. NP-complete problems are also called NPC problems. It is one of the seven major mathematical problems in the world (the Qianxi problem) and is regarded as one of the most prominent problems in logic and computer science. .

6.5 Computational complexity

Making predictions requires traversing the decision tree from root to leaf. Decision trees are usually approximately balanced, so traversing the decision tree takes about O( log 2 m ) O(log_2^m)O(log2m) nodes (Note:log 2 log^2log2 is the binary logarithm. It is equal tolog 2 m = logm / log 2 log_2^m=log^m/log^2log2m=logm/log2 ). Since each node only needs to check one eigenvalue, the overall prediction complexity isO( log 2 m ) O(log_2^m)O(log2m) . Regardless of the number of features. Therefore, predictions are very fast even when dealing with large training sets.

The training algorithm compares all features on all samples at each node (less if max_features is set). Comparing all features of all samples at each node results in a training complexity of O ( n × mlog 2 m ) O(n×m log_2^m)O ( n×m l o g2m) . For small training sets (less than a few thousand instances), Scikit-Learn can speed up training by presorting the data (set presort=True ), but doing so will greatly slow down training on large training sets.

6.6 Gini Impurity or Entropy

By default Gini impurity is used for measurement, however, you can choose entropy as the measure of impurity by setting the hyperparameter criterion to "entropy". The concept of entropy, derived from thermodynamics, is a measure of how chaotic a molecule is: if the molecule remains stationary and well-ordered, the entropy is close to zero. This concept was later extended to various fields, including Shannon's information theory, which measures the average information content of a piece of information [1]: if all the information is the same, the entropy is zero. In machine learning, it is also often used as a measure of impurity: if a dataset contains instances of only one class, its entropy is zero.

Types of decision trees and their implementation:

insert image description here

The codes of various decision trees implemented based on Sklearn are as follows:

from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier() # CART分类树(criterion=”gini”, CART, 分类与回归树)
dtree = DecisionTreeClassifier(criterion='entropy')  # ID3分类树(Iterative Dichotomiser 3)
# C4.5将训练树(即id3算法的输出)转换为 if-then 规则集。然后评估每个规则的这些准确性,以确定应用它们的顺序。剪枝是通过删除规则的前提条件来完成的,如果没有这个前提条件规则的准确性会有所提高。

So should Gini impurity or entropy be used? Actually, for the most part, they are not very different, and the resulting trees are all similar. Gini Impurity is slightly faster to calculate, so it's a good default choice . They differ in that Gini impurity tends to split the most common classes from branches, while entropy tends to produce more balanced trees .

[ileanu and Stoffel, see Zhou Zhihua, "Machine Learning" Chapter 11. 2004] A theoretical analysis of information gain and Gini index also shows that they differ only in 2% of cases.

6.7 Regularization Hyperparameters

Decision trees make very few assumptions about the training data (eg linear models do the opposite, which obviously assumes that the data is linear). If not constrained, the structure of the tree will follow the training set, fit tightly, and likely overfit. Such a model is often referred to as a nonparametric model, which is not to say that it does not contain any parameters (in fact it usually has many), but that the number of parameters is not determined prior to training, resulting in a model structure that is free and close to the data . In contrast, parametric models (such as linear models) have a predetermined set of parameters and therefore have limited degrees of freedom, reducing the risk of overfitting (but increasing the risk of underfitting).

To avoid overfitting, the degree of freedom of the decision tree needs to be reduced during training. By now you should know that this process is called regularization. The choice of regularization hyperparameters depends on the model used, but in general, at least the maximum depth of the decision tree can be limited. In Scikit-Learn, this is controlled by the hyperparameter max_depth (default is None, meaning unlimited). Decreasing max_depth regularizes the model, reducing the risk of overfitting.

⭐DecisionTreeClassifier class has some other parameters that can also limit the shape of the decision tree: min_samples_split (the minimum number of samples a node must have before splitting), min_samples_leaf (the minimum number of samples a leaf node must have), min_weight_fraction_leaf (same as min_samples_leaf, but Expressed as a proportion of the total number of weighted instances), max_leaf_nodes (maximum number of leaf nodes), and max_features (maximum number of features evaluated by splitting each node). Increasing hyperparameter min_* or decreasing max_* will regularize the model.

It is also possible to train the model without constraints before pruning (removing) unnecessary nodes. A node can be considered unnecessary if its children are all leaf nodes, unless the increase in purity it represents is statistically significant. Standard statistical tests (such as the χ2 test) are used to estimate the probability that "the boost is purely by chance" (called the null hypothesis). If this probability (call it the p-value) is above a given threshold (usually 5%, controlled by a hyperparameter), then the node can be considered unnecessary and its children can be removed. The pruning process ends until all unnecessary nodes are removed.

The basic strategies of decision tree pruning are "prepruning" and "post-pruning" [Quinlan, 1993]. Pre-pruning means that in the process of decision tree generation, each node is estimated before division. If the division of the current node cannot improve the generalization performance of the decision tree, the division is stopped and the current node is marked as Leaf nodes; post-pruning is to first generate a complete decision tree from the training set, and then examine the non-leaf nodes from the bottom up. If the subtree corresponding to the node is replaced with a leaf node, the To improve the generalization performance of the decision tree, replace the subtree with a leaf node.

In general, the risk of underfitting of post-pruning decision trees is very small, and the generalization performance is often better than that of pre-pruning decision trees, but the post-pruning process is carried out after the complete decision tree is generated, and it needs to be bottom-up. All non-leaf nodes in the tree are examined one by one, so the training time overhead is much larger than that of unpruned decision trees and pre-pruned decision trees.

6.8 Regression

Decision trees are also capable of performing regression tasks. Let's build a regression tree using Scikit-Learn's DecisionTreeRegressor class and train it on a noisy quadratic dataset with max_depth=2:

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X, y)

The resulting tree looks like this:

This tree looks very similar to the classification tree built earlier. The main difference is that instead of predicting a class at each node, a value is predicted. For example, if you want to make predictions on a new instance with x1=0.6, then traverse from the root node and end up at the leaf node with predicted value=0.111. This prediction is actually the average target value of the 110 instances associated with this leaf node. On these 110 instances, the prediction yielded a mean squared error (MSE) equal to 0.015.

The left side of the figure below shows the predictions of the model. If you set max_depth=3, you will get the prediction shown on the right side of the figure below. Note that the predicted value for each region is always equal to the target average of the instances in that region. The way the algorithm splits each region is to make the most training examples as close as possible to this predicted value.

The CART algorithm works in much the same way as previous methods, except that instead of trying to split the training set in a way that minimizes impurity, it splits the training set in a way that minimizes MSE.

Just like classification tasks, decision trees are prone to overfitting when dealing with regression tasks. Without any regularization (like using the default hyperparameters), you get the predictions on the left side of the graph below. These predictions clearly overfit the training set. Just set min_samples_leaf=10 to get a more reasonable model, as shown on the right side of the figure below.

insert image description here

6.9 Instability

Hopefully by now you are convinced that decision trees are useful: they are easy to understand and explain, easy to use, versatile, and powerful. However, they do have some limitations. First, you may have noticed that decision trees like orthogonal decision boundaries (all splits are perpendicular to the axis), which makes them sensitive to training set rotation. For example, the image below shows a simple linearly separable dataset: on the left, the decision tree can easily split it, while on the right, the decision boundary looks complicated after rotating the dataset by 45 degrees ( no need). Although both decision trees fit the training set very well, the model on the right may not generalize well. One way to limit this problem is to use principal component analysis (see Chapter 8), which often leads to better orientation of the training data.

insert image description here

More generally, the main problem with decision trees is that they are very sensitive to small changes in the training data. For example, if you removed the iris versicolor with the widest petals (4.8cm long and 1.8cm wide) from the iris dataset, and then retrained a decision tree, you might end up with a model like the one shown below. This looks very different from the previous decision tree in Figure 6-2. In fact, since the algorithm used by Scikit-Learn is random, you may get completely different models even on the same training data (unless you set the hyperparameter random_state).

insert image description here

Random forests can limit this instability by averaging predictions over many trees, as we will see in Chapter 7.

6.10 Practice questions

question

  1. If the training set has 1 million instances, what is the approximate depth of training a decision tree (unconstrained)?

  2. In general, is the Gini impurity of a child node higher or lower than its parent node? Is it usually higher/lower? Or is it always higher/lower?

  3. Is it a good idea to reduce max_depth if the decision tree is overfitting the training set?

  4. If the decision tree underfits the training set, is it a good idea to try scaling the input features?

  5. If it takes an hour to train a decision tree on a training set of 1 million instances, how long would it roughly take to train a decision tree on a training set of 10 million instances?

  6. Can setting presort=True speed up training if the training set contains 100k instances?

  7. Train and fine-tune a decision tree for the satellite dataset.

    • a. Use make_moons (n_samples=10000, noise=0.4) to generate a satellite dataset.
    • b. Use train_test_split() to split the training and test sets.
    • c. Use cross-validated grid search (with the help of GridSearchCV) to find suitable hyperparameters for DecisionTreeClassifier. Hint: try various values ​​of max_leaf_nodes.
    • d. Train the entire training set with hyperparameters and measure the performance of the model on the test set. You should get about 85%-87% accuracy.
  8. Follow the steps below to plant a forest.

    • a. Continuing the previous exercise, produce 1000 subsets of the training set, each subset containing 100 randomly selected instances. Hint: Use Scikit-Learn's ShuffleSplit for this.
    • b. Train a decision tree on each subset using the best hyperparameter values ​​obtained earlier. Evaluate these 1000 decision trees on the test set. Because the training set is smaller, these decision trees may perform a bit worse than the first decision tree, only reaching about 80% accuracy.
    • c. The time has come to witness the miracle. For each test set instance, generate predictions for 1000 decision trees, then keep only the most frequent predictions (you can use SciPy's mode() function). This way you get predictions for the majority of votes on the test set.
    • d. Evaluate these predictions on the test set, you should get higher accuracy than the first model (0.5%-1.5% higher). Congratulations, you have trained a random forest classifier!

Answer

  1. The depth of a balanced binary tree containing m leaf nodes is equal to log2(m) (Note: log2 is the base-2 logarithm, log2(m)=log(m)/log(2).), rounded up. Generally speaking, binary decision trees (trees that only make binary decisions, like all trees in Scikit-Learn) are generally balanced at the end of training, and if not restricted, they end up with an average of one instance per leaf node . So if the training set contains 1 million instances, the decision tree will be log2(106) ≈ 20 layers deep (actually it will be a bit more, since decision trees are usually not perfectly balanced).

  2. A node usually has a lower Gini impurity than its parent. This is due to the cost function of the CART training algorithm. The algorithm splits each node by minimizing the weighted sum of the Gini impurities of its children. However, if the impurity of one child is much less than the other, it is also possible to make the Gini impurity of the child higher than that of its parent, as long as the lower impurity child can compensate for the increase. For example, suppose a node contains 4 instances of class A and 1 instance of class B with a Gini impurity equal to insert image description here
    . Now we assume that the dataset is one-dimensional and the instances are arranged in the following order: A,B,A,A,A. You can verify that the algorithm will split the node after the second instance, resulting in two child nodes containing instances A,B and A,A,A respectively. The Gini impurity of the first child node is 1-(1/2)2-(1/2)2=0.5, which is higher than its parent node. This is because the second child node is pure, so the total weighted Gini impurity is equal to insert image description here
    , lower than the Gini impurity of the parent node.

  3. If the decision tree is overfitting the training set, it might be a good idea to lower max_depth, as this will constrain the model, making it regular.

  4. One of the advantages of decision trees is that they don't care whether the training data is scaled or concentrated, so if the decision tree doesn't fit the training set, scaling the input features is just a waste of time.

  5. The training complexity of decision tree is O(n × mlog(m)). So, if you multiply the training set size by 10, the training time will be multiplied by K=(n×10m×log(10m))/(n×m×log(m))=10×log(10m)/log(m ). If m=106, then K≈11.7, so training on 10 million instances takes about 11.7 hours.

  6. Preprocessing the training set can speed up training only if the dataset is smaller than thousands of instances. Setting presort=True significantly slows down training if 100 000 instances are included.

For solutions to Exercises 7 and 8, see Jupyter notebooks: Jupyter notebooks at https://github.com/ageron/handson-ml2.

Guess you like

Origin blog.csdn.net/Morganfs/article/details/124586489
Recommended