Use cases to take you to understand decision trees and unlock insights

This article is shared from Huawei Cloud Community " [Machine Learning | Decision Tree] Harnessing the Potential of Data: Unlocking Insights with Decision Trees ", author: Computer Magician.

decision tree

1.1 Classification

Decision tree is a classification model based on tree structure, which divides the data set into multiple small decision-making units through the step-by-step division of data attributes. Each small decision unit corresponds to a leaf node on which classification decisions are made. The core of the decision tree is how to choose the optimal segmentation attribute. Common decision tree algorithms include ID3, C4.5 and CART.

The input data of decision tree mainly includes training set and test set. The training set is a sample set of known categories, and the test set is an unknown sample set that needs to be classified.

Specifically, the process of building a decision tree can be divided into the following steps:

  1. Select the best features. When building a decision tree, it is necessary to select an optimal feature from the current sample set as the division attribute of the current node. Indexes such as information gain, information gain ratio, or Gini index are usually used to evaluate the division ability of each feature and select the optimal feature.
  2. Divide into subsets. According to the selected optimal features, the current sample set is divided into several subsets . Each subset corresponds to a child node, and the sample set represented by this node is not repeated with the sample set of its parent node.
  3. Build a decision tree recursively. For each child node, repeat the first two steps until all samples are assigned to leaf nodes, and each leaf node corresponds to a category.
  4. pruning operation. Since decision trees are prone to overfitting, pruning is required. Common pruning methods include pre-pruning and post-pruning.

When classifying, the input test sample is gradually matched according to the division method of each attribute, and finally reaches a certain leaf node, and the test sample is classified into the category represented by the leaf node . The output of the decision tree is the classification result for the test sample, that is, the category to which the test sample belongs.

The advantages of decision trees are that they are easy to understand and interpret, can handle different types of data, and do not require data preprocessing . However, decision trees are prone to overfitting problems, so pruning operations are required when building decision trees. Common pruning methods include pre-pruning and post-pruning.

1.1.1 Case

Suppose we want to build a decision tree to predict whether a person will buy a certain product. We will use the following features to make predictions:

  1. Age: The age range is between 18 and 65 years old.
  2. Gender: male or female.
  3. Income: Income ranges from 0 to 100,000.

We have a training set with the following data:

cke_21433.png

Now, we will use this data to build a decision tree model.

First, we choose a feature to be the root node. We can use metrics like information gain or Gini impurity to select the best features. In this example, we choose to use information gain.

Gini index and information gain are both indicators used for feature selection in decision trees, and they each have their own advantages and disadvantages.

The Gini index is a measure of the purity or uncertainty of a data set and is often used for feature selection in decision tree algorithms. It is based on the concept of the Gini coefficient , which is used to measure the probability that two samples randomly selected from the data set will have inconsistent class labels .

The formula for calculating the Gini index is as follows:

cke_51420.png

Among them, Gini(D) represents the Gini index of the data set D, and p_i represents the proportion of samples of the i-th category in the data set D.

The value range of the Gini index is from 0 to 1, and the smaller the value, the higher the purity of the data set, that is, the more consistent the category of the sample . When the data set D contains only one category of samples, the Gini index is 0, indicating that the data set is completely pure. When the sample categories in the data set D are uniformly distributed, the Gini index is the largest (that is, the value is smaller), and it is 1, indicating that the uncertainty of the data set is the highest.

In the decision tree algorithm, the Gini index is used to measure the degree of purity improvement of the data set after selecting a certain feature for division. By calculating the Gini index of each feature, the feature with the smallest Gini index is selected as the basis for division, so as to minimize the uncertainty of the data set.

Compute the information gain for each feature:

  • Information gain for age: 0.029
  • Information gain for gender: 0.152
  • Information gain of income: 0.048

According to the information gain, we choose gender as the root node.

Information gain is a metric used to select decision tree nodes, which measures how much the purity of the dataset improves after a certain feature is selected as a node . The calculation of information gain is based on the concept of information entropy.

Information entropy is a measure used to measure the degree of chaos or uncertainty of a data set. For a two-category problem (such as buying or not), the calculation formula of information entropy is as follows (multi-category is the same, each category is summed):

cke_62592.png

Among them, S is the data set, and p(Yes) and p(No) are the proportions of samples purchased as "Yes" and "No" in the data set, respectively. ( Information entropy means that the more average the distribution, the higher the information content of the sample, the greater the uncertainty, the greater the information entropy, the more uneven the distribution, the larger the proportion, and the information entropy will tend to 0. Therefore, it is determined by the size of the information entropy Classification is to separate some small-scale collections )

The calculation formula of information gain is as follows (addition of different categories of information entropy):

cke_73279.png

Among them, S is the data set, A is the feature to calculate the information gain, Sv is the subset corresponding to a certain value of feature A, |Sv| is the number of samples of the subset Sv, |S| is the sample of the data set S quantity.  (Through the number of subsets to control its influence weight, and then determine the largest information gain ( that is, the smallest information entropy ), the vernacular is to choose a classification that is more mainstream and has more obvious features)

The greater the information gain, it means that using feature A as a node can better segment the data set and improve the purity.

In our case, we calculated the information gain of each feature and selected the feature with the largest information gain as the root node. Then, we split the dataset into subsets based on the value of the root node, and compute information gain for each subset to select the next node. This process continues until a stopping condition is met, such as the samples in the subset all belong to the same class or a predetermined tree depth is reached.

To summarize the following are the pros and cons of Gini index and information gain

advantage:

  • Gini index: The Gini index is a measure of impurity that is computationally simpler and more efficient than information gain . When dealing with large-scale data sets, the calculation speed of Gini index is usually faster than information gain . (Simply calculate the proportion of feature classification, the square of the proportion )
  • Information Gain: Information gain is a measure of how much a feature contributes to a classification task . It is based on the concepts of information theory, which can better handle multi-classification problems . Information gain performs better when dealing with imbalanced data sets , and can better handle the situation of class imbalance. (In addition to calculating the proportion of feature classification, a log function is added, and the log ratio is multiplied by the proportion to increase the proportion of the contribution category .)

shortcoming:

  • Gini index: The Gini index only considers the impurity of the feature, but does not consider the number of values ​​of the feature. This means that the Gini index may be biased towards features with more values. When dealing with features with a large number of values, the Gini index may cause the decision tree to be biased towards these features. (As long as the Gini index needs this threshold, the node, the maximum proportion of samples that can be divided out, the larger the maximum, the more inclined)
  • Information gain: Information gain has a certain preference for features with more values, because it tends to select features with more branches . This can lead to a decision tree that is too complex and prone to overfitting the training data (the depth of the tree should not be too deep). (Information gain is based on a kind of informatics information entropy, according to its nature, the more average the classification, the larger the classification, and the smaller the classification proportion, to divide the nodes.

To sum up, Gini index and information gain have different advantages and disadvantages in different situations. In practical applications, appropriate indicators can be selected according to specific problems and characteristics of data sets.

Next, we split the dataset into two subsets based on the value of gender (male or female).

For the male subset:

For the female subset:

cke_101597.png

For the male subset, we can see that the result is both "yes" and "no", so we need to further divide . We choose age as the next node.

For the value of age (less than or equal to 30 years old and greater than 30 years old):

For the subset ≤ 30 years old:

serial number

income

Buy

1

30,000

no

4

10,000

no

7

50,000

no

For the subset older than 30:

serial number

income

Buy

5

60,000

yes

For the subset of 30 years old or younger, the result of purchase is "no", so we don't need to divide it.

For the subset older than 30, the results of purchase are all "yes", so we don't need to divide again.

For the female subset, the result of the purchase is "yes", so we don't need to divide it.

The final decision tree looks like this:

Gender = Male: 

Age <= 30: No 

Age > 30: Yes 

Gender = Female: Yes

This is an example of a simple decision tree. According to the input features, the decision tree can make predictions based on the value of the features. Note that this is just a simple example, in reality, decision trees can have more features and more complex structures.

First, we use the scikit-learn library to implement a decision tree:

from sklearn import tree 

import numpy as np 

# dataset 

X = np.array([[25, 1, 30000], 

[35, 0, 40000], 

[45, 0, 80000], 

[20, 1, 10000], 

[55, 1, 60000], 

[60, 0, 90000], 

[30, 1, 50000], 

[40, 0, 75000]]) 

Y = np.array([0, 0, 1, 0, 1, 1, 0, 1]) 

# create decision tree model 

clf = tree.DecisionTreeClassifier() 

# train model 

clf = clf.fit(X, Y) 

# predict 

print(clf.predict([[40, 0, 75000],[ 10, 0, 75000]])) # output: [1, 0]

Then, we implement a decision tree without using any machine learning library:

import numpy as np 

class Node: 

def __init__(self, predicted_class): 

self.predicted_class = predicted_class # Predicted category 

self.feature_index = 0 # Feature index 

self.threshold = 0 # Threshold 

self.left = None # Left subtree 

self. right = None # right subtree 

class DecisionTree: 

def __init__(self, max_depth=None): 

self.max_depth = max_depth # maximum depth of decision tree 

def fit(self, X, y): 

self.n_classes_ = len(set(y )) # Number of categories 

self.n_features_ = X.shape[1] # Number of features 

self.tree_ = self._grow_tree(X, y) # Build a decision tree 

def predict(self, X): 

return [self._predict( inputs) for inputs in X] # Predict the input data 



def _best_gini_split(self, X, y): 

m = y.size # The number of samples

if m <= 1: # If the number of samples is less than or equal to 1, it cannot be divided 

return None, None 

num_parent = [np.sum(y == c) for c in range(self.n_classes_)] # Each category is in the parent node The number of samples in 

best_gini = 1.0 - sum((n / m) ** 2 for n in num_parent) # The Gini index of the parent node 

best_idx, best_thr = None, None # Best segmentation feature index and threshold 

for idx in range(self .n_features_): # traverse each feature 

thresholds, classes = zip(*sorted(zip(X[:, idx], y))) # sort samples according to feature values 

​​num_left = [0] * self.n_classes_ # left The number of samples of each category in the child node 

num_right = num_parent.copy() # The number of samples of each category in the right child node, the initial value is the number of samples of the parent node 

for i in range(1, m): # Traverse each Sample 

c = classes[i - 1] # category of the sample 

num_left[c] += 1 # update the number of samples of the corresponding category in the left child node 

num_right[c] -= 1 # update the number of samples of the corresponding category in the right child node 

gini_left = 1.0 - sum(

(num_left[x] / i) ** 2 for x in range(self.n_classes_) 

) # Gini index of the left child node 

gini_right = 1.0 - sum( 

(num_right[x] / (m - i)) ** 2 for x in range(self.n_classes_) 

) # Gini index of the right child node 

gini = (i * gini_left + (m - i) * gini_right) / m # weighted average Gini index 

if thresholds[i] == thresholds[i - 1]: # If the feature values ​​are the same, skip (feature threshold) 

continue 

if gini < best_gini: # If the Gini index is smaller, update the best segmentation feature index and threshold (loop each feature, and each threshold, to Solve the optimal classification 

best_gini = gini 

best_idx = idx 

best_thr = (thresholds[i] + thresholds[i - 1]) / 2 

return best_idx, best_thr # Return the best segmentation feature index and threshold 

def _best_gain_split(self, X, y): 

m = y.size # The number of samples 

if m <= 1: # If the number of samples is less than or equal to 1, it cannot be divided 

return None, None

num_parent = [np.sum(y == c) for c in range(self.n_classes_)] # calculate the number of samples for each category 

best_gain = -1 # initialize the best information gain 

best_idx, best_thr = None, None # initialize the best Best feature index and threshold 

for idx in range(self.n_features_): # traverse each feature 

thresholds, classes = zip(*sorted(zip(X[:, idx], y))) # for each feature value and category Labels are sorted 

num_left = [0] * self.n_classes_ # Initialize the number of categories in the left subtree (the left side is 0, and it is automatically calculated as 0 when it is 0) 

num_right = num_parent.copy() # The number of categories in the right subtree is initialized as The number of categories of the parent node (the right is all) 

for i in range(1, m): # traverse each sample 

c = classes[i - 1] # get the category of the current sample 

num_left[c] += 1 # left subtree The number of categories increases 

num_right[c] -= 1 # The number of categories in the right subtree decreases 

entropy_parent = -sum((num / m) * np.log2(num / m) for num in num_parent if num != 0) # calculation entropy of the parent node

entropy_left = -sum((num / i) * np.log2(num / i) for num in num_left if num != 0) # Calculate the entropy of the left subtree entropy_right = -sum(( 

num / (m - i)) * np.log2(num / (m - i)) for num in num_right if num != 0) # Calculate the entropy of the right subtree 

gain = entropy_parent - (i * entropy_left + (m - i) * entropy_right) / m # Calculate information gain (the left and right information entropy after classification is the smallest) 

if thresholds[i] == thresholds[i - 1]: # If the eigenvalue of the current sample is the same as the eigenvalue of the previous sample, skip (different can be divided) 

continue 

if gain > best_gain: # If the current information gain is greater than the best information gain best_gain 

= gain # Update the best information gain 

best_idx = idx # Update the best feature index 

best_thr = (thresholds[i] + thresholds[i - 1]) / 2 # Update the best threshold (cycle the value of each sample, determine the threshold according to the mean of the two data, and keep looping) 

return best_idx, best_thr # Return the best feature index and threshold 

def _grow_tree(self, X, y, depth=0 ):

num_samples_per_class = [np.sum(y == i) for i in range(self.n_classes_)] # Calculate the number of samples in each category 

predicted_class = np.argmax(num_samples_per_class) # The predicted category is the category with the largest number of samples (ie Determine the class that has the most samples assigned to this branch) 

node = Node(predicted_class=predicted_class) # Create a node 

if depth < self.max_depth: # If the current depth is less than the maximum depth 

idx, thr = self._best_gain_split(X, y) # Calculate the best segmentation 

if idx is not None: # If there is an optimal segmentation 

indices_left = X[:, idx] < thr # The sample index of the left subtree (the index of the idx feature less than the thr threshold) X_left 

, y_left = X [indices_left], y[indices_left] # Samples of the left subtree 

X_right, y_right = X[~indices_left], y[~indices_left] # Samples of the right subtree 

node.feature_index = idx # Set the feature index of 

the node node.threshold = thr # Set the threshold of the node 

node.left = self._grow_tree(X_left, y_left, depth + 1) # Build the left subtree

node.right = self._grow_tree(X_right, y_right, depth + 1) # build the right subtree 

return node # return node 

def _predict(self, inputs): 

node = self.tree_ # get the root node of the decision tree 

while node.left : # If there is a left subtree 

if inputs[node.feature_index] < node.threshold: # If the feature value of the input sample is less than the threshold 

node = node.left # to the left subtree 

else: 

node = node.right # to the right subtree 

return node.predicted_class # Return predicted category 

# Dataset 

X = [[25, 1, 30000], 

[35, 0, 40000], 

[45, 0, 80000], 

[20, 1, 10000], 

[55, 1, 60000], 

[60, 0, 90000], 

[30, 1, 50000], 

[40, 0, 75000]] 

Y = [0, 0, 1, 0, 1, 1, 0, 1] 

# create Decision tree model 

clf = DecisionTree(max_depth=2) 

# training model

clf.fit(np.array(X), np.array(Y)) 

# prediction 

print(clf.predict([[40, 0, 75000],[10, 0, 75000]])) # output: [1 , 0]

Note that this implementation of decision trees without using any machine learning library is a basic version and it may not handle all cases such as missing values, categorical features, etc. In practical applications, we usually use mature machine learning libraries, such as scikit-learn, because they provide more functions and optimizations.

1.2 regression

When a decision tree is used for a regression task, it is called a decision tree regression model. Unlike the classification tree, the leaf nodes of the decision tree regression model no longer represent category labels, but represent a continuous interval or a value . It is also based on a tree structure, and divides the data set into multiple small decision-making units through the step-by-step division of data characteristics, and outputs a predicted value on each leaf node.

The following is the detailed principle of the decision tree regression model:

1. Division process

Similar to the classification tree, the decision tree regression model is also divided by recursive dichotomy. Specifically, starting from the root node, an optimal feature and the optimal split point of this feature are selected. Then the data set is divided into two parts according to the value of the feature, and the left and right subtrees are constructed respectively. Repeat the above steps until the stop condition is met, such as reaching the maximum depth, the number of divided samples is less than the threshold, etc.

2. The output value of the leaf node

When a leaf node is reached, the output value of the leaf node is the average (or median, etc.) of all samples corresponding to the leaf node in the training set .

3. Prediction process

For a test sample, start from the root node, match step by step according to the division method of each feature, and finally reach a certain leaf node, and set the predicted value of the test sample as the output value of the leaf node.

4. Pruning operation

Like classification trees, decision tree regression models are prone to overfitting problems, so pruning is required. Common pruning methods include pre-pruning and post-pruning.

5. Features

A decision tree regression model has the following characteristics:

(1) Easy to explain: the decision tree regression model can intuitively reflect the degree of influence of each feature on the target variable.

(2) Non-parametric: The decision tree regression model does not make any assumptions about the data distribution and is applicable to various types of data.

(3) Multiple features can be processed : the decision tree regression model can process multiple input features at the same time.

(4) Data normalization is not required : the decision tree regression model does not require preprocessing such as normalization of the input data.

Extra

cke_143014.jpeg

Huawei will hold the 8th HUAWEI CONNECT 2023 at the Shanghai World Expo Exhibition Hall and Shanghai World Expo Center on September 20-22, 2023. With the theme of "accelerating industry intelligence", this conference invites thought leaders, business elites, technical experts, partners, developers and other industry colleagues to discuss how to accelerate industry intelligence from business, industry, and ecology.

We sincerely invite you to come to the site, share the opportunities and challenges of intelligentization, discuss the key measures of intelligentization, and experience the innovation and application of intelligent technology. you can:

  • In 100+ keynote speeches, summits, and forums, collide with the viewpoint of accelerating industry intelligence
  • Visit the 17,000-square-meter exhibition area to experience the innovation and application of intelligent technology in the industry at close range
  • Meet face-to-face with technical experts to learn about the latest solutions, development tools, and hands-on
  • Seek business opportunities with customers and partners

Thank you for your support and trust as always, and we look forward to meeting you in Shanghai.

Official website of the conference: https://www.huawei.com/cn/events/huaweiconnect

Welcome to follow the "Huawei Cloud Developer Alliance" official account to get the conference agenda, exciting activities and cutting-edge dry goods.

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~

Redis 7.2.0 was released, the most far-reaching version Chinese programmers refused to write gambling programs, 14 teeth were pulled out, and 88% of the whole body was damaged. Flutter 3.13 was released. System Initiative announced that all its software would be open source. The first large-scale independent App appeared , Grace changed its name to "Doubao" Spring 6.1 is compatible with virtual threads and JDK 21 Linux tablet StarLite 5: default Ubuntu, 12.5-inch Chrome 116 officially released Red Hat redeployed desktop Linux development, the main developer was transferred away Kubernetes 1.28 officially released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/10101050