【Machine Learning】Contact lens selection based on decision tree

Experiment introduction

1. Experimental content

This experiment learns and implements the decision tree algorithm.

2. Experimental objectives

Through this experiment, master the basic principle of decision tree algorithm.

3. Experimental knowledge points

  • Shannon entropy
  • information gain

4. Experimental environment

  • python 3.6.5

5. Preliminary knowledge

  • Basics of Python programming

Preparation

Click the download experiment data module at the top right of the screen, select to download decision_tree_glass.tgz to the specified directory, then select and click File->Open->Upload at the top, upload the data set compression package just downloaded, and then use the following command to decompress:

!tar -zxvf decision_tree_glass.tgz
decision_tree_glass/
decision_tree_glass/lenses.txt
decision_tree_glass/classifierStorage.txt

Decision tree construction---ID3 algorithm

  The core of the ID3 algorithm is to select features corresponding to the information gain criterion on each node of the decision tree, and construct the decision tree recursively. The specific method is: start from the root node (root node), calculate the information gain of all possible features for the node, select the feature with the largest information gain as the feature of the node, and establish child nodes from different values ​​of the feature; then Call the above method recursively on the child nodes to build a decision tree; until the information gain of all features is small or no features can be selected. Finally, a decision tree is obtained. ID3 is equivalent to using the maximum likelihood method to select a probability model.
  Using the results obtained from the decision tree experiment, since the information gain value of the feature A3 (with its own house) is the largest, the feature A3 is selected as the feature of the root node. It divides the training set D into two subsets D1 (A3 takes the value "Yes") and D2 (A3 takes the value "No"). Since D1 only has sample points of the same class, it becomes a leaf node, and the class of the node is marked as "Yes". For D2, it is necessary to select new features from features A1 (age), A2 (job) and A4 (credit status), and calculate the information gain of each feature:


  According to the calculation, the feature A2 (with work) with the largest information gain is selected as the feature of the node. Since A2 has two possible values, two sub-nodes are derived from this node: a sub-node corresponding to "yes" (working), contains 3 samples, and they belong to the same class, so this is a leaf node point, the class is marked as "Yes"; the other is a child node corresponding to "No" (no work), which contains 6 samples, and they also belong to the same class, so this is also a leaf node, and the class is marked as "No" .
  In this way, a decision tree is generated, which uses only two features (with two internal nodes), and the generated decision tree is shown in the following figure:

[Exercise] Decision tree construction --- write code to build decision tree

We use a dictionary to store the structure of the decision tree. For example, the decision tree we analyzed in the previous section can be expressed as:
  {'has its own house': {0: {'has a job': {0: 'no', 1 : 'yes'}}, 1: 'yes'}}
Create the function majorityCnt to count the elements (class labels) that appear most here in the classList, and create the function createTree to recursively build a decision tree. Write the code as follows:

 

# -*- coding: UTF-8 -*-
from math import log
import operator
"""
函数说明:计算给定数据集的经验熵(香农熵)
Parameters:
    dataSet - 数据集
Returns:
    shannonEnt - 经验熵(香农熵)
"""
def calcShannonEnt(dataSet):
    ### Start Code Here ###                      
    #返回数据集的行数
    numEntires = len(dataSet)                        
    #保存每个标签(Label)出现次数的字典
    labelCounts = {}                       
    #对每组特征向量进行统计
    for featVec in dataSet:               
        #提取标签(Label)信息
        currentLabel = featVec[-1]
        #如果标签(Label)没有放入统计次数的字典,添加进去
        if currentLabel not in labelCounts.keys():    #如果标签(Label)没有放入统计次数的字典,添加进去
            labelCounts[currentLabel] = 0
          
                
    #Label计数
        labelCounts[currentLabel] += 1                          
    #经验熵(香农熵)
    shannonEnt = 0.0                     
    #计算香农熵
    for key in labelCounts:
        #选择该标签(Label)的概率
        prob = float(labelCounts[key]) / numEntires 
        #利用公式计算
        shannonEnt -= prob * log(prob, 2)                              
    #返回经验熵(香农熵)
    return shannonEnt
    ### End Code Here ###
    

"""
函数说明:创建测试数据集
Parameters:
    无
Returns:
    dataSet - 数据集
    labels - 特征标签
"""
def createDataSet():
    dataSet = [[0, 0, 0, 0, 'no'],                        #数据集
            [0, 0, 0, 1, 'no'],
            [0, 1, 0, 1, 'yes'],
            [0, 1, 1, 0, 'yes'],
            [0, 0, 0, 0, 'no'],
            [1, 0, 0, 0, 'no'],
            [1, 0, 0, 1, 'no'],
            [1, 1, 1, 1, 'yes'],
            [1, 0, 1, 2, 'yes'],
            [1, 0, 1, 2, 'yes'],
            [2, 0, 1, 2, 'yes'],
            [2, 0, 1, 1, 'yes'],
            [2, 1, 0, 1, 'yes'],
            [2, 1, 0, 2, 'yes'],
            [2, 0, 0, 0, 'no']]
    labels = ['年龄', '有工作', '有自己的房子', '信贷情况']        #特征标签
    return dataSet, labels                             #返回数据集和分类属性
"""
函数说明:按照给定特征划分数据集
Parameters:
    dataSet - 待划分的数据集
    axis - 划分数据集的特征
    value - 需要返回的特征的值
Returns:
    无
"""
def splitDataSet(dataSet, axis, value):       
    retDataSet = []                                        #创建返回的数据集列表
    for featVec in dataSet:                             #遍历数据集
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]                #去掉axis特征
            reducedFeatVec.extend(featVec[axis+1:])     #将符合条件的添加到返回的数据集
            retDataSet.append(reducedFeatVec)
    return retDataSet                                      #返回划分后的数据集
"""
函数说明:选择最优特征
Parameters:
    dataSet - 数据集
Returns:
    bestFeature - 信息增益最大的(最优)特征的索引值
"""
def chooseBestFeatureToSplit(dataSet):
    ### Start Code Here ###
    #特征数量
    numFeatures = len(dataSet[0]) - 1
    #计算数据集的香农熵
    baseEntropy = calcShannonEnt(dataSet)
    #信息增益
    bestInfoGain = 0.0 
    #最优特征的索引值
    bestFeature = -1
    #遍历所有特征
    for i in range(numFeatures):
        #获取dataSet的第i个所有特征
        featList = [example[i] for example in dataSet]
        #创建set集合{},元素不可重复
        uniqueVals = set(featList)
        #经验条件熵
        newEntropy = 0.0
        #计算信息增益
        for value in uniqueVals:
            #subDataSet划分后的子集
            subDataSet = splitDataSet(dataSet, i, value)
            #计算子集的概率
            prob = len(subDataSet) / float(len(dataSet))
            #根据公式计算经验条件熵
            newEntropy += prob * calcShannonEnt(subDataSet)
        #信息增益
        infoGain = baseEntropy - newEntropy
        #打印每个特征的信息增益
        print("第%d个特征的增益为%.3f" % (i, infoGain))
        #计算信息增益
        if (infoGain > bestInfoGain):
            #更新信息增益,找到最大的信息增益
            bestInfoGain = infoGain
            #记录信息增益最大的特征的索引值
            bestFeature = i 
    #返回信息增益最大的特征的索引值
    return bestFeature
    ### End Code Here ###
"""
函数说明:统计classList中出现此处最多的元素(类标签)
Parameters:
    classList - 类标签列表
Returns:
    sortedClassCount[0][0] - 出现此处最多的元素(类标签)
"""
def majorityCnt(classList):
    classCount = {}
    for vote in classList:                                        #统计classList中每个元素出现的次数
        if vote not in classCount.keys():classCount[vote] = 0   
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(), key = operator.itemgetter(1), reverse = True)        #根据字典的值降序排序
    return sortedClassCount[0][0]                                #返回classList中出现次数最多的元素
"""
函数说明:创建决策树
Parameters:
    dataSet - 训练数据集
    labels - 分类属性标签
    featLabels - 存储选择的最优特征标签
Returns:
    myTree - 决策树
"""
def createTree(dataSet, labels, featLabels):
    #取分类标签(是否放贷:yes or no)
    classList = [example[-1] for example in dataSet] 
    #如果类别完全相同则停止继续划分
    if classList.count(classList[0]) == len(classList):            
        return classList[0]
    #遍历完所有特征时返回出现次数最多的类标签
    if len(dataSet[0]) == 1:                                    
        return majorityCnt(classList)
    #选择最优特征
    bestFeat = chooseBestFeatureToSplit(dataSet) 
    #最优特征的标签
    bestFeatLabel = labels[bestFeat]                            
    featLabels.append(bestFeatLabel)
    #根据最优特征的标签生成树
    myTree = {bestFeatLabel:{}}   
    #删除已经使用特征标签
    del(labels[bestFeat])                    
    #得到训练集中所有最优特征的属性值
    featValues = [example[bestFeat] for example in dataSet] 
    #去掉重复的属性值
    uniqueVals = set(featValues) 
    #遍历特征,创建决策树。
    for value in uniqueVals:                                                           
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), labels, featLabels)
    return myTree


if __name__ == '__main__':
    dataSet, labels = createDataSet()
    featLabels = []
    myTree = createTree(dataSet, labels, featLabels)
    print(myTree)
第0个特征的增益为0.083
第1个特征的增益为0.324
第2个特征的增益为0.420
第3个特征的增益为0.363
第0个特征的增益为0.252
第1个特征的增益为0.918
第2个特征的增益为0.474
{'有自己的房子': {0: {'有工作': {0: 'no', 1: 'yes'}}, 1: 'yes'}}

When recursively creating a decision tree, the recursion has two termination conditions: the first stop condition is that all class labels are exactly the same, and the class label is returned directly; the second stop condition is that all features have been used, and the data still cannot be divided into only Groupings that contain unique categories, i.e. decision tree construction fails with insufficient features. At this time, it shows that the data latitude is not enough. Since the second stop condition cannot simply return the unique class label, the class with the largest number of occurrences is selected here as the return value.

[Exercise] Classification using decision trees

  After constructing a decision tree from the training data, we can use it to classify the actual data. When performing data classification, a decision tree is required along with the label vectors used to construct the tree. Then, the program compares the test data with the values ​​on the decision tree, and performs the process recursively until it enters a leaf node; finally, the test data is defined as the type to which the leaf node belongs. In the code for building the decision tree, you can see that there is a featLabels parameter, which is used to record each classification node. When using the decision tree to make predictions, we can input the attribute values ​​​​of the required classification nodes in order. . For example, if you use the decision tree trained in the previous section for classification, you only need to provide the two information of whether the person has a house and whether he has a job, and there is no need to provide redundant information.
  The code for classifying with a decision tree is very simple, and the code is written as follows:

# -*- coding: UTF-8 -*-

"""
函数说明:使用决策树分类
Parameters:
    inputTree - 已经生成的决策树
    featLabels - 存储选择的最优特征标签
    testVec - 测试数据列表,顺序对应最优特征标签
Returns:
    classLabel - 分类结果
"""
def classify(inputTree, featLabels, testVec):
    ### Start Code Here ###
    #实现分类函数
    firstStr = list(inputTree.keys())[0]       #需要获取首个特征的列号,以便从测试数据中取值比较
    secondDict = inputTree[firstStr]           #获得第二个字典
    featIndex = featLabels.index(firstStr)      #获取测试集对应特征数值
    for key in secondDict.keys():
        if(testVec[featIndex] == key):
            if(type(secondDict[key]).__name__ == 'dict'):       #判断该值是否还是字典,如果是,则继续递归
                classlabel = classify(secondDict[key],featLabels,testVec)
            else:
                classlabel = secondDict[key]
    return classlabel
    
    ### End Code Here ###
    
if __name__ == '__main__':
    dataSet, labels = createDataSet()
    featLabels = []
    myTree = createTree(dataSet, labels, featLabels)
    testVec = [0,1]                                        #测试数据
    result = classify(myTree, featLabels, testVec)
    if result == 'yes':
        print('放贷')
    if result == 'no':
        print('不放贷')
第0个特征的增益为0.083
第1个特征的增益为0.324
第2个特征的增益为0.420
第3个特征的增益为0.363
第0个特征的增益为0.252
第1个特征的增益为0.918
第2个特征的增益为0.474
放贷

Here only the classify function is added for decision tree classification. Enter the test data [0,1], which means there is no house, but there is a job.

[Exercise] Predict the type of contact eye based on the decision tree --- use Sklearn

  Once we understand how a decision tree works, we can help people determine the type of lenses they need to wear. The contact lens dataset is a very well-known dataset, which contains many viewing conditions of different eye states and the types of contact lenses recommended by doctors. Contact lens types include hard material (hard), soft material (soft) and not suitable for wearing contact lenses (no lenses).
  The data set has a total of 24 sets of data. The Labels of the data are age, prescript, astigmatic, tearRate, and class in sequence, that is, the first column is age, the second column is symptoms, the third column is whether astigmatism, and the fourth column is the number of tears. , the fifth column is the final classification label. The data is shown in the figure below:


  Next, let's talk about how to use Sklearn to build a decision tree. The sklearn.tree module provides decision tree models for solving classification and regression problems. The method is shown in the figure below:

  We use DecisionTreeClassifie to build a decision tree. This function has 12 parameters: the

parameter description is as follows:
  criterion: feature selection criteria, optional parameters, the default is gini, and it can be set to entropy. gini is the Gini impurity, which is the expected error rate of randomly applying some result from a set to a data item. entropy is the Shannon entropy.
  splitter: Feature division point selection criteria, optional parameters, default is best, can be set to random. The selection strategy for each node. The best parameter is to select the best segmentation feature according to the algorithm, such as gini and entropy. Random randomly finds the local optimal dividing point among some dividing points. The default "best" is suitable when the sample size is not large, and if the sample data size is very large, then "random" is recommended for decision tree construction.
  max_features: The maximum number of features considered when dividing, an optional parameter, and the default is None. The maximum number of features considered when looking for the best segmentation (n_features is the total number of features), there are the following 6 cases: If max_features is an integer number, consider max_features features; if max_features is a floating-point number,
    then Consider int(max_features * n_features) features;
    if max_features is set to auto, then max_features = sqrt(n_features);
    if max_features is set to sqrt, then max_features = sqrt(n_features), the same as auto;
    If max_features is set to log2, then max_features = log2(n_features);
    if max_features is set to None, then max_features = n_features, that is, all features are used. Generally speaking, if the number of sample features is not many, such as less than 50, we can use the default "None". If the number of features is very large, we can flexibly use other values ​​just described to control the maximum feature considered in the division number to control the generation time of the decision tree.
  max_depth: The maximum depth of the decision tree, an optional parameter, the default is None. This parameter is the number of layers of the tree. The concept of layers is, for example, in the loan example, the number of layers of the decision tree is 2 layers. If this parameter is set to None, the decision tree will not limit the depth of subtrees when building subtrees. Generally speaking, this value can be ignored when there is little data or few features. Or if the min_samples_slipt parameter is set, then until there are fewer than min_smaples_split samples. If the model has a large number of samples and many features, it is recommended to limit the maximum depth. The specific value depends on the distribution of the data. Commonly used values ​​can be between 10-100.
  min_samples_split: The minimum number of samples required for internal node subdivision, optional parameter, the default is 2. This value limits the conditions under which the subtree can continue to be divided. If min_samples_split is an integer, then min_samples_split is used as the minimum number of samples when splitting internal nodes, that is, if the samples are less than min_samples_split samples, stop splitting. If min_samples_split is a floating point number, then min_samples_split is a percentage, ceil(min_samples_split * n_samples), and the number is rounded up. If the sample size is not large, do not need to care about this value. It is recommended to increase this value if the sample size is of very large order of magnitude.
  min_samples_leaf: Minimum number of samples for leaf nodes, optional parameter, default is 1. This value limits the minimum number of samples of leaf nodes. If the number of leaf nodes is less than the number of samples, it will be pruned together with sibling nodes. A leaf node requires the least number of samples, that is, how many samples are needed to count as a leaf node at the end of the leaf node. If set to 1, even if there is only 1 sample in this category, the decision tree will be constructed. If min_samples_leaf is an integer, then min_samples_leaf acts as the minimum number of samples. If it is a floating point number, then min_samples_leaf is a percentage, as above, celi(min_samples_leaf * n_samples), the number is rounded up. If the sample size is not large, do not need to care about this value. If the sample size is very large, it is recommended to increase this value class_weight: category weight, an optional parameter, the default is None, and it can also be a dictionary, a list of dictionaries, or balanced. Specifying the weights of each category of samples is mainly to prevent too many samples of certain categories in the training set, causing the trained decision tree to be too biased towards these categories. The weight of the category can be given in the format {class_label:weight}. Here, you can specify the weight of each sample yourself, or use balanced. If you use balanced, the algorithm will calculate the weight by itself, and the sample weight corresponding to the category with a small sample size will be high. Of course, if your sample category distribution has no obvious bias, you can ignore this parameter and choose the default None.
  random_state: optional parameter, default is None. random number seed. If it is a certificate, then random_state will be used as a random number seed for the random number generator. Random number seed, if no random number is set, the random number is related to the current system time, and each moment is different. If a random number seed is set, the same random number seed will generate the same random number at different times. If RandomState instance, then random_state is the random number generator. If None, the random number generator uses np.random.
  min_impurity_split: The minimum impurity of node division, optional parameter, the default is 1e-7. This is a threshold, which limits the growth of the decision tree. If the impurity of a node (Gini coefficient, information gain, mean square error, absolute difference) is less than this threshold, the node will no longer generate child nodes. It is a leaf node.
  min_weight_fraction_leaf: The smallest sample weight sum of leaf nodes, optional parameter, default is 0. This value limits the minimum value of the sum of all sample weights of leaf nodes. If it is less than this value, it will be pruned together with sibling nodes. Generally speaking, if we have many samples with missing values, or the distribution category deviation of the classification tree samples is very large, the sample weight will be introduced, and we should pay attention to this value at this time. max_leaf_nodes: The maximum number of leaf nodes, an optional parameter, the default is None. By limiting the maximum number of leaf nodes, overfitting can be prevented. If restrictions are added, the algorithm will build the optimal decision tree within the maximum number of leaf nodes. If there are not many features, this value can be ignored, but if there are many features, it can be limited, and the specific value can be obtained through cross-validation.
  presort: whether the data is pre-sorted, optional parameter, the default is False, this value is a Boolean value, the default is False not to sort. Generally speaking, if the sample size is small or a decision tree with a small depth is limited, setting it to true can make the division point selection faster and the decision tree establishment faster.
In addition to paying attention to these parameters, other points to pay attention to when tuning parameters are:
  1) When the number of samples is small but the sample features are very large, the decision tree is easy to overfit. Generally speaking, the number of samples is more than the number of features It will be easier to build a robust model.
  2) If the number of samples is small but the sample features are very large, it is recommended to do dimension reduction before fitting the decision tree model, such as principal component analysis (PCA), feature selection (Losso) or independent component analysis (ICA). The dimensionality of such features will be greatly reduced. It will be better to fit the decision tree model again.
  3) It is recommended to use the visualization of the decision tree more often, and at the same time limit the depth of the decision tree first, so that you can first observe the preliminary fitting of the data in the generated decision tree, and then decide whether to increase the depth.
  4) When training the model, pay attention to the category of the samples (mainly referring to the classification tree). If the category distribution is very uneven, consider using class_weight to restrict the model from being too biased towards categories with more samples.
  5) The array of the decision tree uses the float32 type of numpy. If the training data is not in this format, the algorithm will first copy and then run.
  6) If the input sample matrix is ​​sparse, it is recommended to call csc_matrix sparse before fitting, and call csr_matrix sparse before prediction.
sklearn.tree.DecisionTreeClassifier() provides some methods for us to use, as shown in the following figure:

 

[Exercise] Predict the type of contact eye based on the decision tree --- write code

 

# -*- coding: UTF-8 -*-
from sklearn import tree
if __name__ == '__main__':
    fr = open('decision_tree_glass/lenses.txt')
    lenses = [inst.strip().split('\t') for inst in fr.readlines()]
    print(lenses)
    lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']
    clf = tree.DecisionTreeClassifier()
    lenses = clf.fit(lenses, lensesLabels)
[['young', 'myope', 'no', 'reduced', 'no lenses'], ['young', 'myope', 'no', 'normal', 'soft'], ['young', 'myope', 'yes', 'reduced', 'no lenses'], ['young', 'myope', 'yes', 'normal', 'hard'], ['young', 'hyper', 'no', 'reduced', 'no lenses'], ['young', 'hyper', 'no', 'normal', 'soft'], ['young', 'hyper', 'yes', 'reduced', 'no lenses'], ['young', 'hyper', 'yes', 'normal', 'hard'], ['pre', 'myope', 'no', 'reduced', 'no lenses'], ['pre', 'myope', 'no', 'normal', 'soft'], ['pre', 'myope', 'yes', 'reduced', 'no lenses'], ['pre', 'myope', 'yes', 'normal', 'hard'], ['pre', 'hyper', 'no', 'reduced', 'no lenses'], ['pre', 'hyper', 'no', 'normal', 'soft'], ['pre', 'hyper', 'yes', 'reduced', 'no lenses'], ['pre', 'hyper', 'yes', 'normal', 'no lenses'], ['presbyopic', 'myope', 'no', 'reduced', 'no lenses'], ['presbyopic', 'myope', 'no', 'normal', 'no lenses'], ['presbyopic', 'myope', 'yes', 'reduced', 'no lenses'], ['presbyopic', 'myope', 'yes', 'normal', 'hard'], ['presbyopic', 'hyper', 'no', 'reduced', 'no lenses'], ['presbyopic', 'hyper', 'no', 'normal', 'soft'], ['presbyopic', 'hyper', 'yes', 'reduced', 'no lenses'], ['presbyopic', 'hyper', 'yes', 'normal', 'no lenses']]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-79a6415291d2> in <module>
      7     lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']
      8     clf = tree.DecisionTreeClassifier()
----> 9     lenses = clf.fit(lenses, lensesLabels)

/opt/conda/lib/python3.6/site-packages/sklearn/tree/tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    799             sample_weight=sample_weight,
    800             check_input=check_input,
--> 801             X_idx_sorted=X_idx_sorted)
    802         return self
    803 

/opt/conda/lib/python3.6/site-packages/sklearn/tree/tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    114         random_state = check_random_state(self.random_state)
    115         if check_input:
--> 116             X = check_array(X, dtype=DTYPE, accept_sparse="csc")
    117             y = check_array(y, ensure_2d=False, dtype=None)
    118             if issparse(X):

/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    525             try:
    526                 warnings.simplefilter('error', ComplexWarning)
--> 527                 array = np.asarray(array, dtype=dtype, order=order)
    528             except ComplexWarning:
    529                 raise ValueError("Complex data not supported\n"

/opt/conda/lib/python3.6/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

ValueError: could not convert string to float: 'young'

 We can see that the program reported an error, why is this? Because the fit() function cannot receive data of the string type, you can see from the printed information that the data is all of the string type. Before using the fit() function, we need to encode the data set. There are two methods that can be used here: LabelEncoder   :
  Convert strings to incremental values
  ​​OneHotEncoder: Convert strings to integers using the One-of-K algorithm
For string type data serialization, pandas data needs to be generated first, which is convenient for our serialization work. The method used here is: original data -> dictionary -> pandas data, write the code as follows:

# -*- coding: UTF-8 -*-
import pandas as pd
if __name__ == '__main__':
    with open('decision_tree_glass/lenses.txt', 'r') as fr:                                        #加载文件
        lenses = [inst.strip().split('\t') for inst in fr.readlines()]        #处理文件
    lenses_target = []                                                        #提取每组数据的类别,保存在列表里
    for each in lenses:
        lenses_target.append(each[-1])
    lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']            #特征标签       
    lenses_list = []                                                        #保存lenses数据的临时列表
    lenses_dict = {}                                                        #保存lenses数据的字典,用于生成pandas
    for each_label in lensesLabels:                                            #提取信息,生成字典
        for each in lenses:
            lenses_list.append(each[lensesLabels.index(each_label)])
        lenses_dict[each_label] = lenses_list
        lenses_list = []
    print(lenses_dict)                                                        #打印字典信息
    lenses_pd = pd.DataFrame(lenses_dict)                                    #生成pandas.DataFrame
    print(lenses_pd)
{'age': ['young', 'young', 'young', 'young', 'young', 'young', 'young', 'young', 'pre', 'pre', 'pre', 'pre', 'pre', 'pre', 'pre', 'pre', 'presbyopic', 'presbyopic', 'presbyopic', 'presbyopic', 'presbyopic', 'presbyopic', 'presbyopic', 'presbyopic'], 'prescript': ['myope', 'myope', 'myope', 'myope', 'hyper', 'hyper', 'hyper', 'hyper', 'myope', 'myope', 'myope', 'myope', 'hyper', 'hyper', 'hyper', 'hyper', 'myope', 'myope', 'myope', 'myope', 'hyper', 'hyper', 'hyper', 'hyper'], 'astigmatic': ['no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'yes'], 'tearRate': ['reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal']}
           age prescript astigmatic tearRate
0        young     myope         no  reduced
1        young     myope         no   normal
2        young     myope        yes  reduced
3        young     myope        yes   normal
4        young     hyper         no  reduced
5        young     hyper         no   normal
6        young     hyper        yes  reduced
7        young     hyper        yes   normal
8          pre     myope         no  reduced
9          pre     myope         no   normal
10         pre     myope        yes  reduced
11         pre     myope        yes   normal
12         pre     hyper         no  reduced
13         pre     hyper         no   normal
14         pre     hyper        yes  reduced
15         pre     hyper        yes   normal
16  presbyopic     myope         no  reduced
17  presbyopic     myope         no   normal
18  presbyopic     myope        yes  reduced
19  presbyopic     myope        yes   normal
20  presbyopic     hyper         no  reduced
21  presbyopic     hyper         no   normal
22  presbyopic     hyper        yes  reduced
23  presbyopic     hyper        yes   normal
从运行结果可以看出,顺利生成pandas数据。
接下来,将数据序列化,编写代码如下:

#!pip install pydotplus
# -*- coding: UTF-8 -*-
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import pydotplus
from sklearn.externals.six import StringIO
if __name__ == '__main__':
    with open('decision_tree_glass/lenses.txt', 'r') as fr:                                        #加载文件
        lenses = [inst.strip().split('\t') for inst in fr.readlines()]        #处理文件
    lenses_target = []                                                        #提取每组数据的类别,保存在列表里
    for each in lenses:
        lenses_target.append(each[-1])
    lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']            #特征标签       
    lenses_list = []                                                        #保存lenses数据的临时列表
    lenses_dict = {}                                                        #保存lenses数据的字典,用于生成pandas
    for each_label in lensesLabels:                                            #提取信息,生成字典
        for each in lenses:
            lenses_list.append(each[lensesLabels.index(each_label)])
        lenses_dict[each_label] = lenses_list
        lenses_list = []
    # print(lenses_dict)                                                        #打印字典信息
    lenses_pd = pd.DataFrame(lenses_dict)                                    #生成pandas.DataFrame
    print(lenses_pd)                                                        #打印pandas.DataFrame
    le = LabelEncoder()                                                        #创建LabelEncoder()对象,用于序列化            
    for col in lenses_pd.columns:                                            #为每一列序列化
        lenses_pd[col] = le.fit_transform(lenses_pd[col])
    print(lenses_pd)
           age prescript astigmatic tearRate
0        young     myope         no  reduced
1        young     myope         no   normal
2        young     myope        yes  reduced
3        young     myope        yes   normal
4        young     hyper         no  reduced
5        young     hyper         no   normal
6        young     hyper        yes  reduced
7        young     hyper        yes   normal
8          pre     myope         no  reduced
9          pre     myope         no   normal
10         pre     myope        yes  reduced
11         pre     myope        yes   normal
12         pre     hyper         no  reduced
13         pre     hyper         no   normal
14         pre     hyper        yes  reduced
15         pre     hyper        yes   normal
16  presbyopic     myope         no  reduced
17  presbyopic     myope         no   normal
18  presbyopic     myope        yes  reduced
19  presbyopic     myope        yes   normal
20  presbyopic     hyper         no  reduced
21  presbyopic     hyper         no   normal
22  presbyopic     hyper        yes  reduced
23  presbyopic     hyper        yes   normal
    age  prescript  astigmatic  tearRate
0     2          1           0         1
1     2          1           0         0
2     2          1           1         1
3     2          1           1         0
4     2          0           0         1
5     2          0           0         0
6     2          0           1         1
7     2          0           1         0
8     0          1           0         1
9     0          1           0         0
10    0          1           1         1
11    0          1           1         0
12    0          0           0         1
13    0          0           0         0
14    0          0           1         1
15    0          0           1         0
16    1          1           0         1
17    1          1           0         0
18    1          1           1         1
19    1          1           1         0
20    1          0           0         1
21    1          0           0         0
22    1          0           1         1
23    1          0           1         0

As you can see from the print results, we have successfully serialized the data, next. We can fit() the data and build a decision tree.

[Exercise] Predict the type of contact eye based on the decision tree---prediction

After determining the decision tree, we can make predictions. You can take a look at what kind of contact lenses are suitable for you according to your eye condition and age. Use the following code to see the prediction result:
print(clf.predict([[1,1,1,0]]))
The complete code is as follows:

# -*- coding: UTF-8 -*-
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.externals.six import StringIO
from sklearn import tree
import pandas as pd
import numpy as np
import pydotplus

if __name__ == '__main__':
    with open('decision_tree_glass/lenses.txt', 'r') as fr:                                        #加载文件
        lenses = [inst.strip().split('\t') for inst in fr.readlines()]        #处理文件
    lenses_target = []                                                        #提取每组数据的类别,保存在列表里
    for each in lenses:
        lenses_target.append(each[-1])
    print(lenses_target)

    lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']            #特征标签       
    lenses_list = []                                                        #保存lenses数据的临时列表
    lenses_dict = {}                                                        #保存lenses数据的字典,用于生成pandas
    for each_label in lensesLabels:                                            #提取信息,生成字典
        for each in lenses:
            lenses_list.append(each[lensesLabels.index(each_label)])
        lenses_dict[each_label] = lenses_list
        lenses_list = []
    # print(lenses_dict)                                                        #打印字典信息
    lenses_pd = pd.DataFrame(lenses_dict)                                    #生成pandas.DataFrame
    # print(lenses_pd)                                                        #打印pandas.DataFrame
    le = LabelEncoder()                                                        #创建LabelEncoder()对象,用于序列化           
    for col in lenses_pd.columns:                                            #序列化
        lenses_pd[col] = le.fit_transform(lenses_pd[col])
    # print(lenses_pd)                                                        #打印编码信息
    
    ### Start Code Here ###

    #创建DecisionTreeClassifier()对象
    clf = tree.DecisionTreeClassifier()
    #使用数据,构建决策树
    clf.fit(lenses_pd,lenses_target)
    ### End Code Here ###          
    dot_data = StringIO()
    tree.export_graphviz(clf, out_file = dot_data,                            #绘制决策树
                        feature_names = lenses_pd.keys(),
                        class_names = clf.classes_,
                        filled=True, rounded=True,
                        special_characters=True)
    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
    print(clf.predict([[1,1,1,0]]))                                            #保存绘制好的决策树,以PDF的形式存储。
['no lenses', 'soft', 'no lenses', 'hard', 'no lenses', 'soft', 'no lenses', 'hard', 'no lenses', 'soft', 'no lenses', 'hard', 'no lenses', 'soft', 'no lenses', 'no lenses', 'no lenses', 'no lenses', 'no lenses', 'hard', 'no lenses', 'soft', 'no lenses', 'no lenses']
['hard']

Experiment summary

Through this experiment, master the construction and classification of decision tree algorithm and realize the prediction of contact lenses based on decision tree.

Guess you like

Origin blog.csdn.net/weixin_46601559/article/details/125101482