Decision tree pruning: solving model overfitting [decision tree, machine learning]

How to solve the overfitting problem of decision trees through pruning

Decision trees are a powerful machine learning algorithm used to solve 分类problems 回归. The decision tree model makes predictions through tree-structured decision rules, but when building a decision tree, the problem of overfitting often occurs, that is, the model performs well on training data, but performs poorly on unseen data. .

The threat of overfitting

In machine learning, 过拟合is a common problem that refers to a model that performs well on training data, but performs poorly when generalizing to unseen data. This is because decision trees tend to strive to fit each training sample as accurately as possible, resulting in a tree that is too complex and captures noise and random variation in the training set rather than just the true data patterns.

Decision tree pruning: rescuing model overfitting

Decision tree pruning is a technique that reduces the complexity of decision trees and helps prevent overfitting on the training data. The goal of pruning is to remove some branches of the decision tree (or decision rules) to reduce the depth and complexity of the tree, thereby improving the generalization ability of the model. In short, pruning achieves broader applicability of the model by reducing overfitting to specific situations in the training data.

1. Front pruning

Preliminary pruning takes steps before splitting nodes during the decision tree construction process to prevent the tree from becoming too complex. Pre-pruning methods include setting the maximum depth, minimum number of leaf nodes, or minimum number of samples required to split a node. Through these conditional restrictions, we can avoid unnecessary branches during the growth of the tree, thereby reducing the risk of overfitting.

Example: In a data set for a dating website, we use a decision tree to predict whether a user will initiate a second date. Forward pruning can limit the depth of the decision tree and ensure that too many branches are not generated for too small data subsets, thereby improving the generalization ability of the model.

from sklearn.tree import DecisionTreeClassifier

# 创建一个决策树分类器,并设置最大深度为5
tree_classifier = DecisionTreeClassifier(max_depth=5)

# 训练模型
tree_classifier.fit(X_train, y_train)

# 在测试集上进行预测
y_pred = tree_classifier.predict(X_test)

2. Post-pruning

Post-pruning is to reduce the complexity of the tree by removing unnecessary branches after building the complete decision tree. The post-pruning method first constructs a fully grown decision tree, and then selects the appropriate branch for pruning by calculating the impurity of the branch (such as Gini impurity or entropy) and comparing the performance of different pruning schemes. Although this method is more computationally intensive, it often achieves more accurate pruning results.

Example: In medical diagnostics, we use decision trees to predict whether a patient has a specific disease. Post-pruning can help us remove those branches that do not contribute significantly to the final diagnosis, making the model easier to understand and interpret.

from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import _tree

def prune_index(inner_tree, index, threshold):
    if inner_tree.value[index].min() < threshold:
        # 将子树叶子节点设置为空
        inner_tree.children_left[index] = _tree.TREE_LEAF
        inner_tree.children_right[index] = _tree.TREE_LEAF

# 创建一个决策树分类器,并训练完整树
tree_classifier = DecisionTreeClassifier()
tree_classifier.fit(X_train, y_train)

# 设置剪枝的阈值
prune_threshold = 0.01

# 后剪枝
prune_index(tree_classifier.tree_, 0, prune_threshold)

# 在测试集上进行预测
y_pred = tree_classifier.predict(X_test)

Differences and Summary

Both pre-pruning and post-pruning can be used to solve the overfitting problem of decision trees, but they have some differences in implementation:

  • Pre-pruning is a measure taken during the construction of a decision tree, which can avoid unnecessary branches during the growth of the tree, thus limiting the complexity.

  • Post-pruning is performed after the complete decision tree is constructed to reduce the complexity of the tree by removing unnecessary branches. It is usually necessary to calculate impurity and compare the performance of different pruning schemes.

Guess you like

Origin blog.csdn.net/qq_22841387/article/details/133431866