2023 Higher Education Society Cup Mathematical Modeling Ideas - Case: ID3-Decision Tree Classification Algorithm

0 Question Ideas

(Share on CSDN as soon as the competition questions come out)

https://blog.csdn.net/dc_sinor?type=blog

1 Algorithm introduction

The full name of the FP-Tree algorithm is the FrequentPattern Tree algorithm, which is the frequent pattern tree algorithm. Like the Apriori algorithm, it is also used to mine frequent item sets. However, the difference is that the FP-Tree algorithm is an optimization of the Apriori algorithm. It solves the problem of Apriori. The algorithm will generate a large number of candidate sets in the process, while the FP-Tree algorithm discovers frequent patterns without generating candidate sets. However, after the frequent patterns are mined, the steps to generate association rules are still the same as Apriori.

There are two common types of algorithms for mining frequent itemsets, one is the Apriori algorithm and the other is FP-growth. Apriori mines frequent item sets by continuously constructing and filtering candidate sets, which requires scanning the original data multiple times. When the original data is large, the number of disk I/Os is too many and the efficiency is relatively low. FPGrowth is different from Apriori's "trial" strategy. The algorithm only needs to scan the original data twice and compress the original data through the FP-tree data structure, which is more efficient.

FP stands for Frequent Pattern, and the algorithm is mainly divided into two steps: FP-tree construction and mining of frequent item sets.

2 FP tree representation

The FP tree is constructed by reading in transactions one by one and mapping the transactions to a path in the FP tree. Because different transactions may have several identical items, their paths may partially overlap. The more the paths overlap with each other, the better the compression effect obtained using the FP tree structure; if the FP tree is small enough to be stored in memory, frequent item sets can be extracted directly from this in-memory structure without having to repeatedly scan the structure stored in the memory. data on the hard drive.

An FP tree is shown in the figure below:
  insert image description here
Generally, the size of the FP tree is smaller than that of uncompressed data, because transactions of data often share some common items. In the best case, all transactions have the same item set, FP The tree contains only one node path; when each transaction has a unique item set, the worst case occurs. Since the transactions do not contain any common items, the size of the FP tree is actually the same as the size of the original data.

The root node of the FP tree is represented by φ, and the remaining nodes include a data item and the support of the data item on this path; each path is a set of data items that satisfy the minimum support in the training data; the FP tree also includes all Identical items are connected into a linked list, represented by blue connections in the figure above.

In order to quickly access the same item in the tree, it is also necessary to maintain a pointer list (headTable) connecting nodes with the same item. Each list element includes: data item, the global minimum support of the item, and points to the necklace list in the FP tree. The pointer of the header.
  insert image description here

3 Build FP tree

Now have the following data:

insert image description here

The FP-growth algorithm needs to scan the original training set twice to build the FP tree.

In the first scan, all items that do not meet the minimum support are filtered out; items that meet the minimum support are sorted according to the global minimum support. On this basis, for the convenience of processing, the items can also be sorted again according to the keywords of the items.
insert image description here

In the second scan, the FP tree is constructed.

What participates in the scan is the filtered data. If a data item is encountered for the first time, the node is created and a pointer to the node is added to the headTable; otherwise, the node corresponding to the item is found according to the path and the node is modified. information. The specific process is as follows:

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
 As can be seen from the above, the headTable is not created together with the FPTree, but has been created during the first scan. When creating the FPTree, you only need to point the pointer to the corresponding node. Starting from transaction 004, connections between nodes need to be created so that the same items on different paths are connected into a linked list.

4 Implementation code

def loadSimpDat():
    simpDat = [['r', 'z', 'h', 'j', 'p'],
               ['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],
               ['z'],
               ['r', 'x', 'n', 'o', 's'],
               ['y', 'r', 'x', 'z', 'q', 't', 'p'],
               ['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]
    return simpDat

def createInitSet(dataSet):
    retDict = {
    
    }
    for trans in dataSet:
        fset = frozenset(trans)
        retDict.setdefault(fset, 0)
        retDict[fset] += 1
    return retDict

class treeNode:
    def __init__(self, nameValue, numOccur, parentNode):
        self.name = nameValue
        self.count = numOccur
        self.nodeLink = None
        self.parent = parentNode
        self.children = {
    
    }

    def inc(self, numOccur):
        self.count += numOccur

    def disp(self, ind=1):
        print('   ' * ind, self.name, ' ', self.count)
        for child in self.children.values():
            child.disp(ind + 1)


def createTree(dataSet, minSup=1):
    headerTable = {
    
    }
    #此一次遍历数据集, 记录每个数据项的支持度
    for trans in dataSet:
        for item in trans:
            headerTable[item] = headerTable.get(item, 0) + 1

    #根据最小支持度过滤
    lessThanMinsup = list(filter(lambda k:headerTable[k] < minSup, headerTable.keys()))
    for k in lessThanMinsup: del(headerTable[k])

    freqItemSet = set(headerTable.keys())
    #如果所有数据都不满足最小支持度,返回None, None
    if len(freqItemSet) == 0:
        return None, None

    for k in headerTable:
        headerTable[k] = [headerTable[k], None]

    retTree = treeNode('φ', 1, None)
    #第二次遍历数据集,构建fp-tree
    for tranSet, count in dataSet.items():
        #根据最小支持度处理一条训练样本,key:样本中的一个样例,value:该样例的的全局支持度
        localD = {
    
    }
        for item in tranSet:
            if item in freqItemSet:
                localD[item] = headerTable[item][0]

        if len(localD) > 0:
            #根据全局频繁项对每个事务中的数据进行排序,等价于 order by p[1] desc, p[0] desc
            orderedItems = [v[0] for v in sorted(localD.items(), key=lambda p: (p[1],p[0]), reverse=True)]
            updateTree(orderedItems, retTree, headerTable, count)
    return retTree, headerTable


def updateTree(items, inTree, headerTable, count):
    if items[0] in inTree.children:  # check if orderedItems[0] in retTree.children
        inTree.children[items[0]].inc(count)  # incrament count
    else:  # add items[0] to inTree.children
        inTree.children[items[0]] = treeNode(items[0], count, inTree)
        if headerTable[items[0]][1] == None:  # update header table
            headerTable[items[0]][1] = inTree.children[items[0]]
        else:
            updateHeader(headerTable[items[0]][1], inTree.children[items[0]])

    if len(items) > 1:  # call updateTree() with remaining ordered items
        updateTree(items[1:], inTree.children[items[0]], headerTable, count)


def updateHeader(nodeToTest, targetNode):  # this version does not use recursion
    while (nodeToTest.nodeLink != None):  # Do not use recursion to traverse a linked list!
        nodeToTest = nodeToTest.nodeLink
    nodeToTest.nodeLink = targetNode

simpDat = loadSimpDat()
dictDat = createInitSet(simpDat)
myFPTree,myheader = createTree(dictDat, 3)
myFPTree.disp()

The above code does not sort the filtered items of each training data after the first scan, but places the sorting on the second scan, which can simplify the complexity of the code.

Console information:

insert image description here

Modeling information

Data Sharing: The Strongest Modeling Data
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/dc_sinor/article/details/132583016