Matlab implementation of ID3 of decision tree

Of the two forked roads in the forest, I chose the one less traveled by, and everything has changed since then.
--- Robert Frost
Contents
1. Decision Tree Introduction
1.1 Related Concepts
1.2 Graphical Representation
1.3 Rule Representation
2. Decision Tree Information Calculation
3. ID3 Related Introduction
3.1 ID3 Algorithm Overview
3.2 Algorithm Process
4. Matlab Implementation
4.1 Dataset
4.5 Code Implementation

1. Introduction to decision tree

1.1 Related concepts

What is a decision tree? Decision trees are of course used to make decisions as the name suggests.

How to make a decision? A decision-making process that represents logic in tree form. Since it is a tree, there must be branches, so the bifurcation of the branches represents a kind of division for us to judge which branch to move on to. A decision tree consists of the following parts:

(1) Root node: It is the starting point of the decision tree

(2) Branch point: belongs to the internal node, the so-called branch is to select a certain feature or attribute

(3) Internal nodes: including root nodes and branch points, usually taking most of the samples

(4) Leaf node: It is the end point of the decision tree, which determines the label of the dependent variable classification

1.2 Graphic representation

A graphical representation of a decision tree is as follows:

1.3 Rule Representation

The rules of the decision tree are expressed as:

2. Information calculation of decision tree

Why do we do information computing? This is to select the branch of the decision tree. In order to make a branch, we need to calculate some information of the characteristic variable to select the characteristic variable with the largest information gain as the branch of the decision tree.

How to do information calculation? First introduce the following concepts:

(1) Entropy: Because of branching, entropy is equivalent to the disorder of the branch, and the branch is to find the one that reduces the disorder more and has a large information gain. For example, if we want to toss a coin, the probability of both sides appearing is 1/2. At this time, if someone says that the coin must be heads, then the entropy is 0 at this time, that is, the randomness is 0. The calculation formula is as follows:

The more orderly the system, the lower the information entropy; conversely, the more chaotic a system is, the higher the information entropy. Therefore, information entropy can be said to be a measure of the ordering degree of the system.

(2) Information gain: It is the value generated by information. For example, in the above coin example, information gain=1-0=1, the calculation formula is as follows:

Information gain is for specific attributes. Select an attribute to divide the data set D, calculate the purity of the two sets after division, and then calculate the weighted average of the entropy of the two sets, which is the same as when there was no division before. Compared with the entropy, the former is subtracted from the latter to obtain the information gain obtained by dividing the sample set D by this attribute.

3. Introduction to ID3

3.1 Overview of ID3 Algorithm

ID3 is the third generation of iterative binary tree. The ID3 algorithm is actually a greedy algorithm, which is used to construct a decision tree. The ID3 algorithm originated from the concept learning system (CLS). At each node, it has not yet been used to divide the feature with the highest information gain as the division standard, and then continue this process until the decision tree can perfectly complete the classification training to the sample.

3.2 Algorithm process

Input: sample set S, attribute set A

Output: ID3 decision tree.

(1) If all types of attributes are processed, return; otherwise, execute 2)

(2) Calculate the maximum attribute a of information gain, and use this attribute as a node. If the sample can be classified only by attribute a, then return; otherwise, perform 3)

(3) For each possible value v of attribute a, perform the following operations:

i. Take all samples whose value of attribute a is v as a subset Sv of S;

ii. Generate attribute set AT=A-{a};

iii. Using the sample set Sv and the attribute set AT as input, recursively execute the ID3 algorithm;

Four. Matlab implementation

4.1 Dataset

The data set selects the mushroom data in the UCI data set, and the style is as follows:

Dataset download address: https://archive.ics.uci.edu/ml/datasets/Mushroom

mushroom data

4.5 Code implementation

%样本集的熵
sum=20;yes=12;no=8;
Entropy_S=-(yes/sum)*log2(yes/sum)-(no/sum)*log2(no/sum);
%disp(['样本集S的熵:',num2str(Entropy_S)])
%蘑菇表面状况
sum_s=8;yes=6;no=2;%蘑菇表面光滑
Entropy_s=-(yes/sum_s)*log2(yes/sum_s)-(no/sum_s)*log2(no/sum_s);
%disp(['蘑菇表面光滑的熵:',num2str(Entropy_s)])
sum_y=6;yes=2;no=4;%蘑菇表面鳞状
Entropy_y=-(yes/sum_y)*log2(yes/sum_y)-(no/sum_y)*log2(no/sum_y);
%disp(['蘑菇表面鳞状的熵:',num2str(Entropy_y)])
sum_g=2;yes=0;no=2;%蘑菇表面凹槽
Entropy_g=-0-(no/sum_g)*log2(no/sum_g);
%disp(['蘑菇表面鳞状的凹槽:',num2str(Entropy_g)])
sum_f=4;yes=4;no=0;%蘑菇表面纤维
Entropy_f=-(yes/sum_f)*log2(yes/sum_f)-0;
%disp(['蘑菇表面纤维的熵:',num2str(Entropy_f)])
Gain_surface=Entropy_S-(sum_s/sum)*Entropy_s-(sum_y/sum)*Entropy_y-(sum_g/sum)*Entropy_g-(sum_f/sum)*Entropy_f;
disp(['蘑菇表面状况的信息增益:',num2str(Gain_surface)])
%菌褶间距
sum_d=6;yes=4;no=2;%菌褶间距远
Entropy_d=-(yes/sum_d)*log2(yes/sum_d)-(no/sum_d)*log2(no/sum_d);
sum_c=14;yes=8;no=6;%菌褶间距近
Entropy_c=-(yes/sum_c)*log2(yes/sum_c)-(no/sum_c)*log2(no/sum_c);
Gain_gillsize=Entropy_S-(sum_d/sum)*Entropy_d-(sum_c/sum)*Entropy_c;
disp(['菌褶间距的信息增益:',num2str(Gain_gillsize)])
%菌褶大小
sum_b=10;yes=10;no=0;%
Entropy_b=-(yes/sum_b)*log2(yes/sum_b)-0;
sum_n=10;yes=2;no=8;%
Entropy_n=-(yes/sum_n)*log2(yes/sum_n)-(no/sum_n)*log2(no/sum_n);
Gain_gillsize=Entropy_S-(sum_b/sum)*Entropy_b-(sum_n/sum)*Entropy_n;
disp(['菌褶大小的信息增益:',num2str(Gain_gillsize)])
%茎的形状
sum_t=4;yes=4;no=0;%
Entropy_t=-(yes/sum_t)*log2(yes/sum_t)-0;
sum_e=16;yes=8;no=8;%
Entropy_e=-(yes/sum_e)*log2(yes/sum_e)-(no/sum_e)*log2(no/sum_e);
Gain_gillsize=Entropy_S-(sum_t/sum)*Entropy_t-(sum_e/sum)*Entropy_e;
disp(['茎的形状的信息增益:',num2str(Gain_gillsize)])

disp('-------判断第二个分支属性------------')
%% 判断第二个分支属性
%在分类时,分支属性菌褶代大小为宽时,蘑菇均为可食用,所以以菌褶大小窄继续划分
%样本集菌褶大小窄的信息熵
sum=10;yes=2;no=8;
Entropy_S=-(yes/sum)*log2(yes/sum)-(no/sum)*log2(no/sum);
disp(['样本集菌褶大小窄的熵:',num2str(Entropy_S)])
%蘑菇表面状况
sum_s=2;yes=0;no=2;%蘑菇表面光滑
Entropy_s=0-(no/sum_s)*log2(no/sum_s);
sum_y=4;yes=0;no=4;%蘑菇表面鳞状
Entropy_y=0-(no/sum_y)*log2(no/sum_y);
sum_g=2;yes=0;no=2;%蘑菇表面凹槽
Entropy_g=0-(no/sum_g)*log2(no/sum_g);
sum_f=2;yes=2;no=0;%蘑菇表面纤维
Entropy_f=-(yes/sum_f)*log2(yes/sum_f)-0;
Gain_surface=Entropy_S-(sum_s/sum)*Entropy_s-(sum_y/sum)*Entropy_y-(sum_g/sum)*Entropy_g-(sum_f/sum)*Entropy_f;
disp(['蘑菇表面状况的信息增益:',num2str(Gain_surface)])

%菌褶间距
sum_d=3;yes=1;no=2;%菌褶间距远
Entropy_d=-(yes/sum_d)*log2(yes/sum_d)-(no/sum_d)*log2(no/sum_d);
sum_c=7;yes=1;no=6;%菌褶间距近
Entropy_c=-(yes/sum_c)*log2(yes/sum_c)-(no/sum_c)*log2(no/sum_c);
Gain_gillsize=Entropy_S-(sum_d/sum)*Entropy_d-(sum_c/sum)*Entropy_c;
disp(['菌褶间距的信息增益:',num2str(Gain_gillsize)])

%茎的形状
sum_t=1;yes=1;no=0;%菌褶间距远
Entropy_t=-(yes/sum_t)*log2(yes/sum_t)-0;
sum_e=9;yes=1;no=8;%菌褶间距近
Entropy_e=-(yes/sum_e)*log2(yes/sum_e)-(no/sum_e)*log2(no/sum_e);
Gain_gillsize=Entropy_S-(sum_t/sum)*Entropy_t-(sum_e/sum)*Entropy_e;
disp(['茎的形状的信息增益:',num2str(Gain_gillsize)])

The result is as follows:

The resulting decision tree is as follows:

Guess you like

Origin blog.csdn.net/a__12345_/article/details/129098550