python——pandas data analysis (table processing) tool to implement Apriori algorithm

pandas is a tool based on NumPy, the name is very kawaii, and the source is composed of two words "Panel data" (panel data, an econometric term). Pandas incorporates a large number of libraries and some standard data models, providing the tools needed to efficiently manipulate large datasets. It is mainly used for processing large data sets. The data processing speed is the biggest feature, and the rest is a python version of excel.

API documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html

data structure:

Series: A one-dimensional array, very similar to an array in NumPy. In fact, Series is basically based on NumPy's array object. Unlike NumPy's array, Series can customize the label for the data, that is, the index (index), and then access the data in the array through the index.

DataFrame: Two-dimensional tabular data structure. Many functions are similar to data.frame in R. DataFrame can be understood as a container of Series.

Panel: A three-dimensional array, which can be understood as a DataFrame container.

For example:

import numpy as np 
import pandas as pd

datafile='.\opt\list.csv'
outfile='.\opt\output.csv'
df = pd.DataFrame(pd.read_csv(datafile,encoding="utf_8"))#导入文件
print(df.shape)#输入维度
print(df.head())#输出默认前10行
df.pop('video')#删除列
print(df['type'])#输出列
df['type'].replace('科技','科学',inplace = True)#列替换
df.to_csv(outfile,encoding="utf_8")#输出文件

list.csv:

output.csv:

 

 

The Apriori algorithm is introduced below:

The Apriori algorithm is a commonly used algorithm for mining data association rules. The goal of the algorithm is to find the largest K-item frequent set. It uses an iterative method to first search for candidate 1-item sets and their corresponding support, and then pruning to remove low Based on the 1-itemset of the support degree, the frequent 1-itemset is obtained. Then connect the remaining frequent 1-itemsets to obtain the candidate frequent 2-itemsets, filter out the candidate frequent 2-itemsets that are lower than the support degree, and obtain the real frequent binomial sets, and so on, and iterate until it cannot Until frequent k+1 itemsets are found, the set of corresponding frequent k itemsets is the output of the algorithm.

Algorithm flow:

1) Scan the entire data set to get all the data that has appeared as a candidate frequent 1-itemset. k=1, the frequent 0-itemset is an empty set.

2) Mining frequent k-itemsets

    a) Scan the data to calculate the support of candidate frequent k-itemsets

    b) Remove the data sets whose support degree is lower than the threshold in the candidate frequent k-itemset, and obtain the frequent k-itemset. If the obtained frequent k-itemset is empty, the set of frequent k-1 itemsets is directly returned as the algorithm result, and the algorithm ends. If the obtained frequent k-itemset has only one item, the set of frequent k-itemsets is directly returned as the algorithm result, and the algorithm ends.

    c) Based on frequent k-itemsets, connect to generate candidate frequent k+1-itemsets.

3) Let k=k+1, go to step 2.

Reference code https://spaces.ac.cn/archives/3380


#-*- coding: utf-8 -*-
from __future__ import print_function
import pandas as pd

d = pd.read_csv('./opt/test.csv',header=None, dtype = object)

print(u'\n转换原始数据至0-1矩阵...')
import time
start = time.clock()
ct = lambda x : pd.Series(1, index = x)
b = map(ct, d.values)
d = pd.DataFrame(list(b)).fillna(0)
d = (d==1)
end = time.clock()
print(u'\n转换完毕,用时:%0.2f秒' %(end-start))
print(u'\n开始搜索关联规则...')
del b

support = 0.06 #最小支持度
confidence = 0.75 #最小置信度
ms = '--' #连接符,用来区分不同元素,如A--B。需要保证原始表格中不含有该字符

#自定义连接函数,用于实现L_{k-1}到C_k的连接
def connect_string(x, ms):
    x = list(map(lambda i:sorted(i.split(ms)), x))
    l = len(x[0])
    r = []
    for i in range(len(x)):
        for j in range(i,len(x)):
            if x[i][:l-1] == x[j][:l-1] and x[i][l-1] != x[j][l-1]:
                r.append(x[i][:l-1]+sorted([x[j][l-1],x[i][l-1]]))
    return r

#寻找关联规则的函数
def find_rule(d, support, confidence):
    import time
    start = time.clock()
    result = pd.DataFrame(index=['support', 'confidence']) #定义输出结果

    support_series = 1.0*d.sum()/len(d) #支持度序列
    column = list(support_series[support_series > support].index) #初步根据支持度筛选
    k = 0

    while len(column) > 1:
        k = k+1
        print(u'\n正在进行第%s次搜索...' %k)
        column = connect_string(column, ms)
        print(u'数目:%s...' %len(column))
        sf = lambda i: d[i].prod(axis=1, numeric_only = True) #新一批支持度的计算函数

        #创建连接数据,这一步耗时、耗内存最严重。当数据集较大时,可以考虑并行运算优化。
        d_2 = pd.DataFrame(list(map(sf,column)), index = [ms.join(i) for i in column]).T

        support_series_2 = 1.0*d_2[[ms.join(i) for i in column]].sum()/len(d) #计算连接后的支持度
        column = list(support_series_2[support_series_2 > support].index) #新一轮支持度筛选
        support_series = support_series.append(support_series_2)
        column2 = []
        
        for i in column: #遍历可能的推理,如{A,B,C}究竟是A+B-->C还是B+C-->A还是C+A-->B?
            i = i.split(ms)
            for j in range(len(i)):
                column2.append(i[:j]+i[j+1:]+i[j:j+1])
        
        cofidence_series = pd.Series(index=[ms.join(i) for i in column2]) #定义置信度序列
        
        for i in column2: #计算置信度序列
            cofidence_series[ms.join(i)] = support_series[ms.join(sorted(i))]/support_series[ms.join(i[:len(i)-1])]
        
        for i in cofidence_series[cofidence_series > confidence].index: #置信度筛选
            result[i] = 0.0
            result[i]['confidence'] = cofidence_series[i]
            result[i]['support'] = support_series[ms.join(sorted(i.split(ms)))]

    result = result.T.sort_values(['confidence','support'], ascending = False) #结果整理,输出
    end = time.clock()
    print(u'\n搜索完成,用时:%0.2f秒' %(end-start))
    print(u'\n结果为:')
    print(result)
    
    return result

find_rule(d, support, confidence).to_csv('./opt/out.csv')

Test text: https://spaces.ac.cn/usr/uploads/2015/07/3424358296.txt

Recommended reference blog:

https://www.jianshu.com/p/a77b0bc736f2

https://www.cnblogs.com/pinard/p/6293298.html

https://spaces.ac.cn/archives/3380

 

 

 

 

 

Guess you like

Origin blog.csdn.net/sm9sun/article/details/87916197