Iris flower image classification based on decision tree

Table of contents

1. Decision tree model

2. Dataset

3. Build a decision tree

4. Case results and analysis

1) Decision area map of the decision tree

2) Visualization of the decision-making process of the decision tree

3) The code to save the decision tree image in the local directory:

4) Decision tree image

5) Result analysis 


1. Decision tree model

        Decision Tree is a common type of machine learning method that can be applied to classification and regression tasks. Here we mainly discuss classification decision trees. Decision trees are based on the tree structure to make decisions. The figure below uses a decision tree to decide whether to see an object. You can think of a decision tree as making a decision based on a series of questions to be answered for data classification.

        Generally, a decision tree consists of nodes (Node) and directed edges (Directed Edge). A decision tree contains a root node (Root Node), several internal nodes (Internal Node) and several leaf nodes (Leaf Node); the leaf node corresponds to the decision result, and the other nodes correspond to an attribute test; each The sample set contained in each node is divided into sub-nodes according to the results of attribute tests; the root node contains the full set of samples. Starting from the root node, a certain feature of the instance is tested, and the instance is assigned to its child nodes according to the test results; thus, each path from the root node to the leaf node corresponds to a decision test sequence.

        In classification problems, a decision tree represents the process of classifying instances based on features. It can be considered as a set of if-then rules, or as a conditional probability distribution defined on the feature space and class space. Decision tree learning generally includes three steps: feature selection, decision tree generation, and decision tree pruning.

2. Dataset

        The iris dataset is a dataset that comes with sklearn in Python. Here, the test set and training set are first divided according to the ratio of 3:7. Since the data itself is a four-dimensional feature, the data dimension reduction technology is used here to reduce the data to two dimensions. code show as below:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA

# 加载数据集
iris = load_iris()
X = iris.data
y = iris.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# 对训练数据进行降维
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

3. Build a decision tree

        Decision trees can build complex decision boundaries by partitioning the feature space into different rectangles. However, in practice, it should be noted that the deeper the decision tree, the more complex its decision boundary, which is more likely to lead to overfitting. The following code uses sklearn to train a decision tree model, assuming a maximum depth of 4, and using the Gini index as a measure of purity. In order to present a better visualization effect, the example adjusts the proportion of the sample feature data. The code to generate the decision tree is as follows:

from sklearn.tree import DecisionTreeClassifier
import numpy as np
import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions

tree_model = DecisionTreeClassifier(criterion='gini', max_depth=4, random_state=1)
tree_model.fit(X_train_pca, y_train)
x_combined = np.vstack((X_train_pca,X_test_pca))
y_combined = np.hstack((y_train,y_test))

plt.subplot(121)
plot_decision_regions(x_combined, y_combined, clf=tree_model)
plt.xlabel("petal length(cm)")
plt.ylabel("petal width(cm)")
plt.legend(loc='upper center')
plt.title('Train & Test Data')

plt.subplot(122)
plot_decision_regions(X_test_pca, y_test, clf=tree_model)
plt.xlabel("petal length(cm)")
plt.ylabel("petal width(cm)")
plt.legend(loc='upper center')
plt.title('Test Data')

plt.tight_layout()
plt.show()

Note: If the mlxtend module is not installed, please install it. The installation code is as follows:

!pip install --upgrade mlxtend

The reason to follow the latest version is to prevent parameter name errors caused by different versions.

4. Case results and analysis

1) Decision area map of the decision tree

         After the code is executed, the decision area of ​​the obtained decision tree is shown in the figure above, and its decision boundary is parallel to the coordinate axis.

        Note: The horizontal and vertical coordinates of this code are obtained through the first two principal components (principal components) after PCA dimensionality reduction. In this example, the original data set has 4 features, but after PCA dimensionality reduction, only the first two principal components are retained, so the horizontal and vertical coordinates represent the values ​​of the first two principal components respectively. The advantage of this is that high-dimensional data can be visualized as points on a two-dimensional plane, and their category information can be represented by different colors and shapes. Therefore, we can understand the negative numbers in the horizontal and vertical coordinates. For convenience, in the final analysis, we will analyze according to the corresponding coordinate values.

2) Visualization of the decision-making process of the decision tree

        The sklearn library provides the ability to export the decision tree in the format of a .dot file after the model is trained ( the following code does not export the file in the .dot format, if you want to export this type of file, just put the 'out_file=None' in the code None' can be changed to the file name you want ). Then call the Graphviz program to complete the visualization of the decision-making process of the decision tree.

        It can be downloaded from the official website of Graphviz (URL: Graphviz ), which supports Linux, Windows, Solaris and FreeBSD. In addition to Graphviz, you also need to use a Python library called pydotplus, which has similar functions to Graphviz. This library allows you to convert .dot data files into images of decision trees.

Tutorial for installing Graphviz: graphviz graphic installation tutorial for Xiaobai (2021 latest) - Programmer Sought

Code to install pydotplus:

pip install pydotplus

3) The code to save the decision tree image in the local directory:

from sklearn.tree import export_graphviz
from pydotplus import graph_from_dot_data

# 导出决策树
# out_file参数指定导出的文件名及路径,feature_names参数指定特征名称(如果有的话),filled和rounded参数用于美化决策树。
dot_data = export_graphviz(tree_model, out_file=None, feature_names=['petal length', 'petal width'], class_names=['Setosa', 'Versicolor', 'Virginica'], filled=True, rounded=True, special_characters=True)

# 把.dot文件转成png图片

graph = graph_from_dot_data(dot_data)
graph.write_png('tree.png')

        By setting out_file=None, the data can be assigned to dot_data. The parameters filled, rounded, class_names, and feature_names are optional. Adding color and rounding the edges of the frame will make the visual effect of the image better. Displays most of the taxonomic labels and features of the taxonomy at each node.

4) Decision tree image

5) Result analysis 

        After visualizing a decision-making process, the decision tree shown in the figure above can be obtained. On the decision tree, the node splitting process of the training set can be well traced. Starting from the root node containing all 105 samples, taking the petal width -1.683cm ( note: the reason for the negative number is explained at 1) ) as the decision condition, divide all samples into 34 samples and 71 samples. two child nodes. It can be seen from the above figure that the purity of the left child node has reached a high level, and only contains Iris-setosa samples (Gini index = 0), so it becomes a leaf node. The Gini index of the child node on the right is 0.495, and it needs to be further split into two categories: Iris-versicolor and Iris-virginca.

        From the decision area diagram of the decision tree, it can be seen that the decision tree performs well in the classification of iris flowers. Although sklearn does not currently provide the function of manually pruning decision trees. However, in the previous code for building the decision tree, you only need to modify the parameter max_depth of the decision tree to 3 to achieve the effect of pre-limiting the depth of the decision tree, as shown in the figure below.

 

Guess you like

Origin blog.csdn.net/weixin_51756038/article/details/130119380