random forest economics

KNN high-dimensional space is difficult to find neighbors (curse of dimensionality)

The prediction of the classification tree is very simple, just drop an observation down the tree (answer a series of yes or no questions), use the majority vote rule (majority vote rule)

The CART algorithm uses a "binary tree", which essentially recursively partitions the "feature space" (recursive partitioning), and always cuts along the direction parallel to the x-axis of a certain variable every time, cutting into rectangular area.

Node Impurity Functions

classification tree

What variable to choose (split variable) for splitting?

Goal: Make the purity of the two child nodes after the split the highest

Gini index (Gini index)

entropy()

Pruning

If you keep splitting, there will be only one observation value for each leaf node at the end, resulting in overfitting (such as overfitting the noise in the data)

Cost-complexity Pruning

Determination of moderators by cross-validation

The regression tree uses the square error minimization criterion (the residual sum of squares of the least squares method)

When the decision tree is divided into regions, taking into account the influence of x on y, the partition is more intelligent (adaptive nearest neighbor)

Easy to generalize to high-dimensional space, and not affected by noise variables

When the decision tree is recursively split, only the "hyper rectangle" (hyper rectangle) is considered. If the real decision boundary is far from this or irregular, it may cause a large error

 

Integrated learning based on decision trees (bagging method, random forest, boosting method) can obtain smooth decision margins and greatly improve prediction accuracy

Guess you like

Origin blog.csdn.net/m0_67105022/article/details/123761751