Decision tree algorithms: they are everywhere【01/2】

Exploring Natural Language Processing with Hugging Face Transformers

1. Description

        This article first describes the concept of decision tree, and then expands the principle of decision tree, as well as supporting theoretical information entropy, and gives a comprehensive description of decision tree. If you want to know all the information about decision tree, theory and implementation, please read this article.

2. Introduction

        "Two roads diverged in a wood and I took the road less traveled by and that made all the difference.

        Robert Frost's poem "The Road Not Take" is such a sentence. The poem eloquently describes how a person's decisions can have a lasting impact on their life. Aren't decisions an integral part of our lives? The paths we walk and the decisions we make greatly affect our destiny and who we are.

        Decision tree algorithms are like the poem, showing how decisions create new branches in a tree; likewise, a small decision can create a different course and change our lives.

 

        The most basic understanding of a decision tree is that it is a huge structure of nested if-else statements.

        This is a small categorical dataset with 6 rows and 3 columns, which we will try to solve using nested if-else statements like so:

 

g = input("Enter gender(M/F): ")
print("Occupation: ")
print("1-> Student")
print("2-> Programmer")
o = int(input("Enter occupation: " ))
if g=='F':
    if o==1:
        print('Fortnite')
    elif o==2:
        print('GitHub')
elif g=='M':
    if o==1:
        print('Minecraft')
    elif o==2:
        print('VS Code')

        The first question I ask myself is why do we need decision trees if they are just a collection of if-else statements used to solve a problem? the answer is:

  1. if-else statements can be brittle, meaning they are easily affected by small changes in the data. On the other hand, decision trees are more robust to changes in the data.
  2. if-else statements are not as scalable as decision trees, which means they can be slow to train and predict on large datasets.
  3. Interpreting and debugging if-else statements can be challenging. Decision trees, on the other hand, are a more intuitive and visual way of representing decision rules.

Diagrammatically, this would look like this:

 

2.1 The basic terms related to the tree data structure are as follows:

  • Root Node: The root node is the topmost node. It is the starting node for generating the decision tree.
  • Parent Node: A parent node is a node that has one or more child nodes.
  • Child Node: A child node is a node branched from a parent node. A parent node is a node that makes a decision, and a child node is a node that represents the possible outcomes of that decision.
  • Leaf Node: A leaf node in a decision tree is a node without any child nodes. Leaf nodes represent the terminal nodes of the decision tree, and they contain the final predictions of the model.
  • Splitting: The process of dividing a node into two or more child nodes.
  • Pruning:  Pruning is the opposite of splitting. The process of removing children of a decision node is called pruning.

2.2 What is a decision tree algorithm?

A decision tree is a supervised learning algorithm that uses a tree-like structure to make decisions based on input data. It divides data into branches and assigns results to leaf nodes.

Decision trees are used for classification and regression tasks, providing models that are efficient, accurate, and easy to understand.

Different libraries use different algorithms to perform decision trees, but in this article, we will discuss  CART: Classification and Regression Trees .

CART is a recursive algorithm that starts at the root node of the tree and splits the data into two or more subsets based on the value of a single feature. This process is repeated until each subset is pure, meaning that all data points in the subgroup belong to the same class.
The shopping cart uses various impurity measures, such as entropy and Gini impurity, to decide which feature to split at each node.

Mathematically, a decision tree slices our coordinate system into a hypercuboid using a hyperplane parallel to any one axis.

2.3 Why do we need decision trees in machine learning?

  1. Ease of understanding and interpretation, which helps developers debug and understand models.
  2. Noise resistance means that decision trees can handle noisy data relatively well, making them a good choice for problems where the data is not completely clean.
  3. Decision tree training is efficient and flexible.
  4. Decision trees are widely used in credit scoring, fraud detection, medical diagnosis, customer segmentation, etc.

3. Some key concepts related to the CART algorithm are:

  1. entropy
  2. Access to information
  3. Gini index

Entropy, information gain and Gini impurity are three basic concepts in decision tree algorithm. They are used to measure the impurity of the data and determine which features to split the data on. Decision tree algorithms are used to build accurate and reliable models using these concepts.

3.1. Entropy:

Entropy is a measure of impurity or disorder. In simple terms, it is a measure of randomness.

Entropy helps us identify the most important features for predicting the target variable . It helps to reduce the impurities of the data, thus making the model more accurate. The formula for entropy is as follows: 

 

  • P is: the probability of being
  • P No: Probability of No
  • E:  Sake

Some basic observations of entropy are as follows:

  1. For 2-class questions ("Yes" or "No"/"True" or "False"), the minimum entropy is 0 and the maximum entropy is 1.
  2. For more than two classes, the minimum entropy is 0, but the maximum value can be greater than 1.
  3. A data set with 0 impurity is not helpful for learning, but a data set with half "yes" and half "no" entropy will be 1, such a data set is suitable for learning.

3.2. Information acquisition:

Information gain helps the algorithm decide which attribute should be selected as a decision node or which feature to split the data on. It is a measure of split quality. Information gain is a measure of the entropy reduction achieved by splitting nodes on specific features.

Building a decision tree is all about finding an attribute that returns the highest information gain. A feature with high information gain is a good splitter as it will lead to purer nodes. The formula for information gain is as follows:

 
  • E (parent): entropy of the parent
  • Weighted entropy/information = {weighted average}*E(sub)
  • E (children):  child entropy
  • IG:  Access to Information

3.3 So, what is the relationship between entropy and information gain?

The goal of the decision tree algorithm is to eventually disentangle the labels as we run down, or in other words, produce the purest distribution of possible titles at each node. In decision trees, entropy and information gain determine the characteristics of the split nodes at each step of tree construction. The part with the highest information gain is chosen, as this will result in the most significant reduction in entropy and the purest nodes.

The higher the information gain, the lower the entropy. Entropy is a measure of impurity. The impure the data, the greater the uncertainty. By reducing the entropy of the data, we make it more specific and thus achieve higher information gain.

Information gain measures how much data entropy is reduced by splitting on specific features. The more the entropy is reduced, the more uniform the child nodes and the better the split.

3.4. Gini impurity/index:

This is the misclassification measure used when the data contains multiclass labels. Gini is similar to entropy, but it can be calculated much faster than entropy. Algorithms like CART (Classification and Regression Trees) use Gini as the impurity parameter.

The minimum value of the Gini index is 0, while the maximum value of the Gini index is 0.5.

It is faster and less computationally expensive than entropy because entropy uses logarithms, but slightly better results are obtained using the entropy criterion.

 

 
  • p  is = probability of being
  • pno  = probability of no
  • i = probability of class i
Gini impurity and entropy; source: quantdare.com
 

4. How does the decision tree select the root node and perform splitting?

        The five columns in the dataset are Foreground, Temperature, Humidity, Wind, and Game. The play column contains the result of whether to play based on the conditions of the four input columns.

4.1. Entropy of the entire dataset:

        The entropy of the entire dataset refers to the entropy of the target column. Here, our target column is "Play".

import pandas as pd
df = pd.read_csv('play_tennis.csv')
df = df.iloc[:,1:]
print(df)

 

 

here

9/14 = no. yes/total number of outcomes in the sample space

5/14 = no. None/Total Outcome in Sample Space

4.2. Information gain for each column:

Since there are four input columns, "Outlook", "Temperature", "Humidity" and "Wind", we need to calculate the information obtained by each column.

The information gain for the outlook column is as follows:

print(df.groupby('outlook')['play'].value_counts())
 

        Here we can see that in our first column Outlook, there are three categorical values: sunny, cloudy, and rainy.

(a) We need to calculate the entropy for sunny, cloudy and rainy days:

 

 (b) The information in the Outlook column is equal to the weighted average * entropy:

(c) Information gain in the outlook column:

 

(d) Other columns: the information gains of temperature, humidity and wind are as follows:

        Select the column with the highest information gain as the root node.  Since the outlook column has the highest information gain, it is considered the root node.
        Once it successfully splits the training set in two, it uses the same logic to split the subset, then the subset, and so on, recursively, until the maximum depth is reached.

5. The meaning of decision tree using sklearn library:

  • One of the key points to remember before implementing a decision tree classifier is that even in classification problems, the CART algorithm requires all columns to be numeric.
  • This is because, CART uses a greedy algorithm to build decision trees, and it needs to be able to calculate the impurity for each node in the tree.
  • Machine learning models use a one-hot encoding technique to convert categorical variables into numerical values, which is easily implemented using the sklearn label encoder.
#imported the dataset:
import pandas as pd
df = pd.read_csv('play_tennis.csv')
df = df.iloc[:, 1:6]
print(df.head())

# performed one hot encoding: converted categorical to numerical columns
from sklearn.preprocessing import LabelEncoder
cols = ['outlook', 'temp', 'humidity', 'wind', 'play']
le = LabelEncoder()
def preprocess(df):
    for col in cols:
        df[col] = le.fit_transform(df[col])
    return df
preprocess(df)
print(df.head())

#selected the input and target columns:
x = df.iloc[:, 0:4].values
y = df.iloc[:, 4]

#imported Decision Tree Classifier:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion='entropy')

# trained the model:
dt.fit(x, y)

 

One hot encoding of our dataset

# visualization of decision tree:
from sklearn import tree
import graphviz
tree.plot_tree(dt, feature_names=cols, rounded=True, filled=True)

5.1 Visualization of decision tree:

 

decision tree

5.2. Hyperparameters of decision tree

Hyperparameters are parameters that are explicitly specified and control the training process. They play a crucial role in model optimization.

Some of the most common hyperparameters for decision tree classifiers are as follows:

  • max_depth: This hyperparameter specifies the maximum depth of the tree. A deeper tree will have more splits and be able to capture more information about the data, but it may also be more prone to overfitting.
  • min_samples_split: This hyperparameter specifies the minimum number of samples required to split a node. Lower values ​​will allow the tree to split more nodes, which may result in more complex trees.
  • min_samples_leaf: This hyperparameter specifies the minimum number of samples required in a leaf node. Lower values ​​will allow the tree to have more leaf nodes, which can improve tree accuracy.
  • Criterion:  This hyperparameter specifies the split criteria that can be used to determine the best split for a node. The most common criteria are " Gini" and "Entropy".
  • Allocator:  This hyperparameter specifies the split algorithm that will be used to find the best split for nodes. The most common splitters are Optimal, Random, and Uniform.

6. Disadvantages of decision trees

        Decision trees are a powerful machine learning algorithm that can be used for classification and regression tasks. However, they also have some disadvantages:

  • Overfitting: Decision trees can be prone to overfitting, meaning they work well on the training data but not on the test set.
  • Sensitivity to small changes in the data: Small changes in the data can significantly affect the structure of the decision tree, leading to overfitting or instability.
  • Interpretability: Decision trees can be challenging, especially if they are large and complex. This can make it difficult to understand how the model works and use it to make decisions.
  • Large decision trees are computationally expensive.

        Here I add a GitHub link to a heart attack detection model implemented using a decision tree algorithm. Mlynmoi Bora

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132489434