Decision Tree (Classification & Regression Algorithms)

Decision Tree

Decision Tree is a classification and regression method based on a tree structure.

In a classification problem, each leaf node of the decision tree represents a classification result, and the non-leaf node in the middle represents a feature attribute and its value. By constantly asking the attribute values ​​of the data, you can get to the leaves step by step. node to get the classification results.

In the regression problem, each leaf node of the decision tree represents a predicted value, and the non-leaf node in the middle represents a feature attribute and its value. By constantly asking the attribute values ​​of the data, you can step to the leaf node , get the prediction results.

scikit-learn implements decision tree classifier

from sklearn import tree
X = [[0, 0], [1, 1]]
y = [0, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)

There are many kinds of decision tree algorithms, among which the most commonly used ones are the ID3 algorithm and the C4.5 algorithm. Both algorithms are based on the principle of information entropy for feature selection.

Information entropy is used to measure the degree of chaos of a data set. The smaller it is, the more orderly the data set is, and the larger it is, the more chaotic the data set is.

The ID3 algorithm selects the feature with the largest information gain as a node each time, while the C4.5 algorithm selects the feature with the largest information gain ratio as a node.

The advantages of the decision tree algorithm are that it is easy to understand, easy to implement, and highly interpretable. The disadvantage is that it is easy to overfit and needs to be pruned; it is more difficult to deal with continuous data and missing values.

The following are the basic steps of the decision tree algorithm:

Feature selection: Select the best features for segmentation.

Decision tree generation: Build a decision tree based on selected features.

Decision tree pruning: Prune the decision tree to prevent overfitting.

Specifically, the decision tree can be constructed according to the following steps:

Initialization: Set the categories of all data to the same category, and divide the data set into a training set and a test set.

Feature selection: For each node, select the best feature for partitioning.

Decision tree generation: Create sub-nodes based on the selected features and continue to divide the sub-nodes.

Decision tree pruning: Prune the decision tree to prevent overfitting.

Test: Use the test set to verify the accuracy of the decision tree.

Usage: Use decision trees for classification or prediction.

It should be noted that the decision tree algorithm usually requires features to have discrete values. If the features are continuous, they need to be discretized. In addition, the decision tree algorithm can also be used for multi-classification and regression problems.

The ID3 algorithm and the C4.5 algorithm are both decision tree algorithms for feature selection based on the principle of information entropy.

ID3

The ID3 algorithm is one of the earliest decision tree algorithms, which was proposed by Ross Quinlan in 1986. The core idea of ​​the ID3 algorithm is to select the feature with the largest information gain on each node for partitioning. Information gain is used to measure the ability of a feature to classify a data set. The greater the information gain, the stronger the feature's ability to classify the data set.

Specifically, the feature selection process of the ID3 algorithm is as follows:

Calculate the information entropy of the data set as the initial uncertainty.

For each feature, its information gain is calculated, and the feature with the largest information gain is selected as the node.

Create a child node for each value of the selected feature, divide the data set into multiple subsets according to the value, and continue to recursively build the decision tree.

Stop recursive construction until all child nodes belong to the same category or no more features are available.

The disadvantage of the ID3 algorithm is that it is easy to overfit, because it tends to select features with more values ​​as nodes, which will cause the decision tree to be too complex and overfit the training set.

Python implements ID3 algorithm

import math

def entropy(data):
    entropy = 0
    total = len(data)
    counts = {
    
    }
    for item in data:
        label = item[-1]
        if label not in counts:
            counts[label] = 0
        counts[label] += 1
    for label in counts:
        prob = counts[label] / total
        entropy -= prob * math.log(prob, 2)
    return entropy

def split_data(data, column):
    splits = {
    
    }
    for item in data:
        value = item[column]
        if value not in splits:
            splits[value] = []
        splits[value].append(item)
    return splits

def choose_best_feature(data):
    best_feature = None
    best_gain = 0
    total_entropy = entropy(data)
    for i in range(len(data[0]) - 1):
        feature_entropy = 0
        feature_splits = split_data(data, i)
        for value in feature_splits:
            prob = len(feature_splits[value]) / len(data)
            feature_entropy += prob * entropy(feature_splits[value])
        gain = total_entropy - feature_entropy
        if gain > best_gain:
            best_gain = gain
            best_feature = i
    return best_feature

def majority_label(labels):
    counts = {
    
    }
    for label in labels:
        if label not in counts:
            counts[label] = 0
        counts[label] += 1
    max_count = 0
    majority = None
    for label in counts:
        if counts[label] > max_count:
            max_count = counts[label]
            majority = label
    return majority

def id3(data, features):
    labels = [item[-1] for item in data]
    if len(set(labels)) == 1:
        return labels[0]
    if len(data[0]) == 1:
        return majority_label(labels)
    best_feature = choose_best_feature(data)
    if best_feature is None:
        return majority_label(labels)
    best_feature_name = features[best_feature]
    tree = {
    
    best_feature_name: {
    
    }}
    feature_splits = split_data(data, best_feature)
    for value in feature_splits:
        sub_features = features[:best_feature] + features[best_feature+1:]
        subtree = id3(feature_splits[value], sub_features)
        tree[best_feature_name][value] = subtree
    return tree

Define some auxiliary functions, including calculating entropy(), splitting data according to features split_data(), selecting the best feature choose_best_feature(), calculating the majority label majority_label(), etc.
Then, define the main function id3(), which uses recursion to build the decision tree.
In the id3() function, first check whether the data belongs to the same category, and if so, return the label of that category.
Then, it checks if there are any remaining features that can be used to segment the data, and if not, returns the label that appears the most in the data.
Next, the best features are selected for segmentation and subtrees are recursively constructed based on the segmented data.
Finally, the constructed subtree is added to the dictionary of the current node to form a complete decision tree.

C4.5

The C4.5 algorithm is an improved version of the ID3 algorithm, which was proposed by Ross Quinlan in 1993. The core idea of ​​the C4.5 algorithm is to select the feature with the largest information gain ratio on each node for partitioning. The information gain ratio is used to measure the ability of a feature to classify a data set. Dividing it by the inherent value of the feature can solve the problem of the ID3 algorithm being biased towards features with more values.

Specifically, the feature selection process of the C4.5 algorithm is as follows:

Calculate the information entropy of the data set as the initial uncertainty.

For each feature, its information gain ratio is calculated, and the feature with the largest information gain ratio is selected as the node.

Create a child node for each value of the selected feature, divide the data set into multiple subsets according to the value, and continue to recursively build the decision tree.

Stop recursive construction until all child nodes belong to the same category or no more features are available.

The advantage of the C4.5 algorithm is that it can solve the bias problem of the ID3 algorithm and obtain better feature selection results. The disadvantage is that division operations are required when calculating the information gain ratio, which affects the efficiency of the algorithm.

Python implements C4.5 algorithm

import math

def entropy(data):
    entropy = 0
    total = len(data)
    counts = {
    
    }
    for item in data:
        label = item[-1]
        if label not in counts:
            counts[label] = 0
        counts[label] += 1
    for label in counts:
        prob = counts[label] / total
        entropy -= prob * math.log(prob, 2)
    return entropy

def split_data(data, column):
    splits = {
    
    }
    for item in data:
        value = item[column]
        if value not in splits:
            splits[value] = []
        splits[value].append(item)
    return splits

def choose_best_feature(data):
    best_feature = None
    best_gain_ratio = 0
    total_entropy = entropy(data)
    for i in range(len(data[0]) - 1):
        feature_entropy = 0
        feature_splits = split_data(data, i)
        feature_iv = 0
        for value in feature_splits:
            prob = len(feature_splits[value]) / len(data)
            feature_entropy += prob * entropy(feature_splits[value])
            feature_iv -= prob * math.log(prob, 2)
        gain = total_entropy - feature_entropy
        gain_ratio = gain / feature_iv
        if gain_ratio > best_gain_ratio:
            best_gain_ratio = gain_ratio
            best_feature = i
    return best_feature

def majority_label(labels):
    counts = {
    
    }
    for label in labels:
        if label not in counts:
            counts[label] = 0
        counts[label] += 1
    max_count = 0
    majority = None
    for label in counts:
        if counts[label] > max_count:
            max_count = counts[label]
            majority = label
    return majority

def c45(data, features):
    labels = [item[-1] for item in data]
    if len(set(labels)) == 1:
        return labels[0]
    if len(data[0]) == 1:
        return majority_label(labels)
    best_feature = choose_best_feature(data)
    if best_feature is None:
        return majority_label(labels)
    best_feature_name = features[best_feature]
    tree = {
    
    best_feature_name: {
    
    }}
    feature_splits = split_data(data, best_feature)
    for value in feature_splits:
        sub_features = features[:best_feature] + features[best_feature+1:]
        subtree = c45(feature_splits[value], sub_features)
        tree[best_feature_name][value] = subtree
    return tree

Some auxiliary functions are defined, including calculating entropy(), splitting data according to features split_data(), selecting the best feature choose_best_feature(), calculating the majority label majority_label(), etc.
Then, the main function c45() is defined, which uses recursion to build the decision tree.
In the c45() function, first check whether the data belongs to the same category, and if so, return the label of that category.
Then, it checks if there are any remaining features that can be used to segment the data, and if not, returns the label that appears the most in the data.
Next, the best features are selected for segmentation and subtrees are recursively constructed based on the segmented data.
Finally, the constructed subtree is added to the dictionary of the current node to form a complete decision tree.

Guess you like

Origin blog.csdn.net/weixin_45646640/article/details/130451910