Implement ID3 algorithm based on Python [100011192]

1. Homework tasks

100011192-Implementation of ID3 algorithm based on Python

Program to implement the ID3 algorithm, and generate a decision tree for the data in the table below.

ID color size act age inflated
1 YELLOW SMALL STRETCH ADULT T
2 YELLOW SMALL STRETCH CHILD T
3 YELLOW SMALL DIP CHILD F
4 YELLOW LARGE STRETCH ADULT T
5 YELLOW LARGE DIP ADULT T
6 YELLOW LARGE DIP CHILD F
7 PURPLE SMALL STRETCH CHILD T
8 PURPLE SMALL DIP ADULT T
9 PURPLE SMALL DIP CHILD F
10 PURPLE LARGE STRETCH CHILD T

Question Tip: The data file format can be designed, such as the color attribute value YELLOW: 0, PURPLE: 1, etc., and the program reads the training set data from the specified data file.

Problem expansion: It is required to demonstrate the process of calculating the information gain of each attribute and the process of generating a decision tree.

2. Operating environment

  • Programming language: Python
  • Use third-party libraries: Numpy, Matplotlib, Scikit-learn
  • IDE:PyCharm
  • Operating system: WIndows10

3. Algorithm introduction

A decision tree is a tree structure (it can be binary or non-binary). Each non-leaf node represents a test on a feature attribute, each branch represents the output of this feature attribute on a certain value range, and each leaf node stores a category. The process of using a decision tree to make a decision is to start from the root node, test the corresponding feature attributes in the item to be classified, and select the output branch according to its value until reaching the leaf node, and use the category stored in the leaf node as the decision result. To quote an example from "Machine Learning" (Watermelon Book):

Gradually classify watermelons into corresponding categories through attributes, and the learning process of the decision tree is the process of finding the optimal decision-making solution.

Basic algorithm flow:

  • enter:
训练集 D={(x1,y1),(x2,y2),(x3,y3),(x4,y4), ... , (xm,ym)}

属性集 A={a1,a2,a3,a4, ... , a5}
  • process:
函数TreeGenerate(D, A)
生成节点
if D 中样本全属于同一类别C then
将node标记为C类叶子节点;return
end if
从A中选择最优划分属性a*;
for a* 的每一个a’ do
为node生成一个分支;另Dv表示D中在a*上取值为a’的end样本子集;
if Dv为空 then
将分支结点标记为叶结点,其类别标记为D中样本最多的类;return
else
以TreeGenerate(Dv,A\{a*})为分支结点
end if
end for
  • output:

A decision tree with node as the root node

ID3 is a kind of decision tree algorithm, which uses information entropy and information gain as a measure

4. Program analysis

4.1 Make a dataset:

Because the ID has no reference value for the actual environment, I will remove the ID attribute from the job data set this time.

1.fr = open(r'data/text.txt')  
2.listWm = [inst.strip().split(',')[1:] for inst in fr.readlines()[1:]]  

Remove the id in each row of data through python's slicing function (realized by slide[1:])

Screenshot of the dataset:

4.2 Output decision tree results

1.print(json.dumps(Trees, ensure_ascii=False))  
Json格式   {
"act": {
"DIP": {
"age": {
"CHILD": "F", 
"ADULT": "T"
}
}, 
"STRETCH": "T"
}
}

During the running of the program, it is concluded that the decision-making ability of act is greater than age (because the information entropy of act < the information entropy of age)

1.def createTree(dataSet, labels):  
2.    classList = [example[-1] for example in dataSet]  # 类别向量  
3.    if classList.count(classList[0]) == len(classList):  # 如果只有一个类别,返回  
4.        return classList[0]  
5.    if len(dataSet[0]) == 1:  # 如果所有特征都被遍历完了,返回出现次数最多的类别  
6.        return majorityCnt(classList)  
7.    bestFeat,bestGain = chooseBestFeatureToSplit(dataSet)  # 最优划分属性的索引  
8.    bestFeatLabel = labels[bestFeat]  # 最优划分属性的标签  
9.    print("当前最佳属性%s,信息熵%f" % (bestFeatLabel,bestGain))  
10.    myTree = {bestFeatLabel: {}}  
11.    del (labels[bestFeat])  # 已经选择的特征不再参与分类  
12.    featValues = [example[bestFeat] for example in dataSet]  
13.    uniqueValue = set(featValues)  # 该属性所有可能取值,也就是节点的分支  
14.    for value in uniqueValue:  # 对每个分支,递归构建树  
15.        subLabels = labels[:]  
16.        myTree[bestFeatLabel][value] = createTree(  
17.            splitDataSet(dataSet, bestFeat, value), subLabels)  
18.    return myTree  

Calculate the value of information gain according to the current attribute, and output the result

4.3 Visual decision tree:

1.def plotTree(myTree, parentPt, nodeTxt):  
2.    numLeafs = getNumLeafs(myTree)  
3.    depth = getTreeDepth(myTree)  
4.    firstStr = list(myTree.keys())[0]  
5.    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs)) / 2.0 / plotTree.totalW,  
6.              plotTree.yOff)  
7.    plotMidText(cntrPt, parentPt, nodeTxt)  
8.    plotNode(firstStr, cntrPt, parentPt, decisionNode)  
9.    secondDict = myTree[firstStr]  
10.    plotTree.yOff = plotTree.yOff - 1.0 / plotTree.totalD  
11.    for key in secondDict.keys():  
12.        if type(secondDict[key]).__name__ == 'dict':  
13.            plotTree(secondDict[key], cntrPt, str(key))  
14.        else:  
15.            plotTree.xOff = plotTree.xOff + 1.0 / plotTree.totalW  
16.            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff),  
17.                     cntrPt, leafNode)  
18.            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))  
19.    plotTree.yOff = plotTree.yOff + 1.0 / plotTree.totalD  

Use the third-party library matplotlib to draw pictures

5. Interface screenshot and analysis

5.1 Use the diagram to roughly observe the division of different attributes:

color:

size:

act:

age:

5.2 View the impact of attributes on the division of results:

color:

There are four columns for this attribute, indicating that it cannot be completely divided

size:

This attribute is the same as above and cannot be fully divided

act:

As can be seen from the figure, STRETCH is all T, indicating that this attribute can connect leaf nodes

age:

It can be seen from the figure that all ADULTs are T, indicating that this attribute can connect leaf nodes

5.3 Program running console output result:

5.4 Decision tree visualization results:

From the output results, we can know that the act attribute is the most thorough division of the results, followed by the age attribute

According to the visualization diagram, it can be seen that

act = STRETCH => T
act = DIP and age = CHILD => F
act = DIP and age = ADULT => T

6. Experience

  • First of all, through this experiment, I learned how to visualize the decision tree
  • Through this experiment, I reviewed the knowledge related to decision trees that I have learned before.
  • The amount of data in this experiment is limited, and it is all discretized data, so I did not analyze the overall data distribution
  • Because the decision tree algorithm is not very related to the analysis of linear correlation analysis, I did not measure the correlation between different attributes
  • Through this experiment, I have a clearer understanding of the process of machine learning related projects, which also laid the foundation for me to do machine learning related to pathological images in the future

♻️ Resources

insert image description here

Size: 3.96MB
➡️ Resource download: https://download.csdn.net/download/s1t16/87547881
Note: If the current article or code violates your rights, please private message the author to delete it!

Guess you like

Origin blog.csdn.net/s1t16/article/details/131323199