"Introduction to Data Mining" experimental class - six related experiments, data mining analysis

Sixth experiment, data mining association analysis

First, the purpose of the experiment

1. Understand the basic principles of Apriori algorithm

2. Understand the basic principles of the algorithm FP growth

3. Learn to achieve Apriori algorithm in python

4. Learn algorithm FP growth in python

Second, the experimental tool

1. Anaconda

2. sklearn

3. Pandas

Third, the experimental introduction

Apriori algorithm has great influence in the field of discovery association rules. Algorithm named because a priori algorithm uses a frequent item set properties (prior) knowledge. In a specific experiment, Apriori algorithm will find the correlation rule is divided into two steps: the first step through iteration retrieve all frequent itemset transaction database, i.e. support is not less than the threshold value set key set by the user ; second step using frequent itemsets configured to meet the minimum rule of a user confidence. Wherein, digging or identify all frequent itemsets is the core of the algorithm, it accounted for most of the overall amount of calculation.

In the study of depth-first data mining algorithms, Han, who did not use the potential method for solving the frequent item sets frequent item sets, but the proposed algorithm is called frequency pattern growth (FP_growth) of. The algorithm scans the database created by the root FP_tree and labeled as null, for each transaction database D Tran, according to L in order to Tran frequent item ordering, frequent set list of items [p sort after Tran | P], where p is the first element, P is the reserve list. Then calls the function insert_tree ([p | P], T), if there is a child node of the tree T and N N.item_name = p.item_name, node N will count by 1; if not, create a new node N, the number of design 1, is connected to its parent node T, node connected to a node connecting structure of the same name. If P is not empty, it is a recursive call insert_tree (P, N). Due to the compression of the contents of the database, and when the item is written frequently FP_tree structure retains the information between the connected set of items. Solving the problem of frequent item sets, it is converted to recursively find the shortest frequent mode and connect its long suffix constitute a frequent pattern of problems.

Fourth, the experiment content

1. Using python language Apriori algorithm, frequent itemset mining the data set in the following table, the minimum support of 30%.

image.png
Specific code Apriori algorithm implemented python

def load_data_set(): data_set = [['a','b','d','e'], ['b', 'c', 'd'], ['a','b','d','e'], ['a','b','d','e'], ['a','b','d','e'], ['b', 'd', 'e'], ['c', 'd'], ['a', 'b', 'c'], ['a', 'd', 'e'], ['b', 'd']] return data_set def Create_C1(data_set): ''' 参数:数据库事务集 ''' C1 = set() for t in data_set: for item in t: item_set = frozenset([item]) # 为生成频繁项目集时扫描数据库时以提供issubset()功能 C1.add(item_set) return C1 def is_apriori(Ck_item, Lk_sub_1): ''' 参数:候选频繁k项集,频繁k-1项集 ''' for item in Ck_item: sub_item = Ck_item - frozenset([item]) if sub_item not in Lk_sub_1: return False return True def Create_Ck(Lk_sub_1, k): ''' # 参数:频繁k-1项集,当前要生成的候选频繁几项集 ''' Ck = set() len_Lk_sub_1 = len(Lk_sub_1) list_Lk_sub_1 = list(Lk_sub_1) for i in range(len_Lk_sub_1): #i: [0, len_Lk_sub_1) for j in range(i+1, len_Lk_sub_1): #j: [i+1, len_Lk_sub_1) l1 = list(list_Lk_sub_1[i]) l2 = list(list_Lk_sub_1[j]) l1.sort() l2.sort() # 判断l1的前k-1-1个元素与l2的前k-1-1个元素对应位是否全部相同 # list[s:t]:截取s到t范围的元素生成一个新list if l1[0:k-2] == l2[0:k-2]: Ck_item = list_Lk_sub_1[i] | list_Lk_sub_1[j] if is_apriori(Ck_item, Lk_sub_1): Ck.add(Ck_item) return Ck def Generate_Lk_By_Ck(data_set, Ck, min_support, support_data): ''' 参数:数据库事务集,候选频繁k项集,最小支持度,项目集-支持度dic ''' Lk = set() # 通过dic记录候选频繁k项集的事务支持个数 item_count = {} for t in data_set: for Ck_item in Ck: if Ck_item.issubset(t): if Ck_item not in item_count: item_count[Ck_item] = 1 else: item_count[Ck_item] += 1 data_num = float(len(data_set)) for item in item_count: if(item_count[item] / data_num) >= min_support: Lk.add(item) support_data[item] = item_count[item] / data_num return Lk def Generate_L(data_set, max_k, min_support): ''' 参数:数据库事务集,求的最高项目集为k项,最小支持度 ''' # 创建一个频繁项目集为key,其支持度为value的dic support_data = {} C1 = Create_C1(data_set) L1 = Generate_Lk_By_Ck(data_set, C1, min_support, support_data) Lk_sub_1 = L1.copy() # 对L1进行浅copy L = [] L.append(Lk_sub_1) # 末尾添加指定元素 for k in range(2, max_k+1): Ck = Create_Ck(Lk_sub_1, k) Lk = Generate_Lk_By_Ck(data_set, Ck, min_support, support_data) Lk_sub_1 = Lk.copy() L.append(Lk_sub_1) return L, support_data def Generate_Rule(L, support_data, min_confidence): ''' 参数:所有的频繁项目集,项目集-支持度dic,最小置信度 ''' rule_list = [] sub_set_list = [] for i in range(len(L)): for frequent_set in L[i]: for sub_set in sub_set_list: if sub_set.issubset(frequent_set): conf = support_data[frequent_set] / support_data[sub_set] # 将rule声明为tuple rule = (sub_set, frequent_set-sub_set, conf) if conf >= min_confidence and rule not in rule_list: rule_list.append(rule) sub_set_list.append(frequent_set) return rule_list if __name__ == "__main__": data_set = load_data_set() L, support_data = Generate_L(data_set, 4, 0.3)#最小支持度是30% rule_list = Generate_Rule(L, support_data, 0.7) for Lk in L: print("="*55) print("frequent " + str(len(list(Lk)[0])) + "-itemsets\t\tsupport") print("="*55) for frequent_set in Lk: print(frequent_set, support_data[frequent_set]) print() print("Rules") for item in rule_list: print(item[0], "=>", item[1], "'s conf: ", item[2])

Frequent item set:
image.png

2. (OPTIONAL) using the python language algorithm FP growth, mining frequent item sets on the table in the data set, minimum support of 30%.

Five experiments summarized (write harvest this experiment, problems encountered, etc.)

By learning and programming of this experiment, a preliminary understanding of the transcendental nature of the basic principles of Apriori algorithm is the use of frequent item set in nature, that is, all non-empty subsets of frequent items must also be frequent. Apriori algorithm uses an iterative method called search layer by layer, where k itemsets used to explore the (k + 1) itemsets. Firstly, by scanning the database, the cumulative count for each item, and collected to meet the minimum support items, identify a set of frequent sets. This set is denoted L1. Then, using the L1 find frequent 2-itemsets L2, L3 identify the L2, and so on until no longer to find k frequent itemsets. Each Lk need to find a complete scan of a database. Apriori algorithm uses a priori nature of frequent item sets to compress the search space.
But weak python programming skills, need to be strengthened in order to learn a programming language better algorithm to realize it, and outputs frequent item sets.

Reproduced in: https: //www.cnblogs.com/wonker/p/11062728.html

Guess you like

Origin blog.csdn.net/weixin_34174132/article/details/93709062