Big Data-A Case Study of Association Rules Mining

Environment: virtual machine hive+local spark+python(pyspark)

Data: commodity order data + commodity category data

Steps: After uploading data to hdfs, complete hive table creation, data processing, association rule mining, and data visualization in python

Realized function: mining the information in the commodity order to obtain the association relationship between the commodity combinations (this article only deals with the order, not the type)

1. Data preparation

Upload the GoodsOrder.csv and GoodsTypes.csv files to the data folder created by hdfs

hdfs dfs -mkdir /hive_data/homework/data/order            
hdfs dfs -mkdir /hive_data/homework/data/type
hdfs dfs -put /export/data/hive_data/GoodsOrder.csv /hive_data/homework/data/order
hdfs dfs -put /export/data/hive_data/GoodsTypes.csv /hive_data/homework/data/type

2.pyspark creates hive table

The premise is that the operation of connecting the local spark to the virtual machine hive has been completed, see the previous article

And there is a small pit here. If spark directly deletes the hive table, it will only delete the metadata of the table, and will not delete the data file of the table. If spark.sql("drop..") is used directly, the same name will be recreated later For the hive table, it will always prompt that a certain folder already exists, so I checked it and can delete it directly in hive, or use the following command in spark to delete

spark.sql("ALTER TABLE homework.type SET TBLPROPERTIES ('external.table.purge'='true')")
spark.sql("drop table homework.type")

Create hive table -- order and type

def createTable():
    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL Hive integration example") \
        .enableHiveSupport() \
        .getOrCreate()
    spark.sql("show databases").show()
    # 创建数据库不指定location，则默认创建在hive.metastore.warehouse.dir设置的文件夹中，这个在hive-site.xml中设置
    spark.sql("create database IF NOT EXISTS homework")
    '''
    数据文件通过hdfs命令，hdfs dfs -put /export/data/hive_data/download.csv /hive_data/homework/data
    由虚拟机上传到hdfs的/hive_data/homework/data文件夹下,
    访问的前缀在hadoop的core-site.xml文件中设置，值为hdfs://master,master对应为node01和node02，
    具体的选择由Zookeeper完成，使用status命令查看哪一台虚拟机被选举为leader，当前是node2为leader，
    因此通过下面的地址就可以访问到hdfs文件系统(端口号为9000)
    注意建表语句的中表的数据格式要与导入的数据一致
    而且要注意加上分隔符的设置，否则导入都为null
    '''
    spark.sql("use homework")
    spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS homework.order(id INT,good STRING) "
              "ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION "
              "'hdfs://192.168.121.131:9000/hive_data/homework/data/order'")
    spark.sql("select * from homework.order").show()

    spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS homework.type(good STRING,type STRING) "
              "ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION "
              "'hdfs://192.168.121.131:9000/hive_data/homework/data/type'")
    spark.sql("show tables").show()
    spark.sql("select * from homework.type").show()

PS: If you find that the select result is garbled, it is likely that the csv file is not saved as utf-8. After modifying the format, delete the file on hdfs and re-upload the modified file. Delete the hive table and import it again. Solve garbled characters

Created successfully

3.pyspark data processing

The association rule algorithm used is Apriori, which is implemented here through the mlxtend library, because it provides support and is more detailed, so it is necessary to convert the data in the order table into the required list format first.

# 按照id进行聚合,Goods_list存储每个订单中的商品
order_list = order.groupBy("id").agg(F.collect_list("good").alias("Goods_list"))
order_list.show()

# 将Goods_list列的内容取出存入good_list列表
Goods_list = order_list.select("Goods_list").collect()
good_list = []
for i in Goods_list:
     good_list.append(i["Goods_list"])

4. Association rule mining

Here we study the association rules between the items in the order, that is to say, the probability that the user purchases another item while purchasing a certain item. Appropriate results can be obtained by setting the support and confidence. A support greater than one indicates a positive correlation, and a support less than one indicates a negative correlation.

'''
2.关联规则挖掘
'''
TE = TransactionEncoder()  # 构造转换模型
data = TE.fit_transform(good_list)  # 将原始数据转换为bool型
df = pd.DataFrame(data, columns=TE.columns_)  # 使用DataFrame存储bool数据

item_set = apriori(df, min_support=0.05, use_colnames=True)  # 设置最小支持度参数
print(item_set)

# 提示:Empty DataFrame----可能是没有超过该阈值的频繁项目,因此item_set为空,rules调用出错
rules = association_rules(item_set, min_threshold=0.2)  # 设置最小置信度,根据最小置信度在频繁项集中产生强关联规则
pd.set_option('expand_frame_repr', False)
print(rules)

data_a = []  # 柱状图的横坐标，记录关联关系
data_b = []  # 关联的第一个柱，表示置信度
data_c = []  # 关联的第二个柱，表示提升度
for i, j in rules.iterrows():  # 'for index, row in dataset.iterrows():'index是行索引值，row是对应的行内容
    X = j['antecedents']
    Y = j['consequents']
    Z = j['confidence']
    U = j['lift']
    x = ','.join([item for item in X])
    y = ','.join([item for item in Y])
    xy = x + "→" + y
    print(xy + " 置信度=" + str(Z) + " 提升度=" + str(U))
    data_a.append(xy)
    data_b.append(Z)
    data_c.append(U)

5. Visualization of correlation

# 绘制柱状图
plt.rcParams['font.sans-serif'] = ["SimHei"]  # 不支持中文，所以进行临时设置
size = len(data_a)
place = np.arange(size)  # 根据数据个数生成柱子的中点位置，也是放置横轴数据的位置
# 1.设置横轴
plt.figure(figsize=(15, 12), dpi=200)
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.set_xticks(place)  # x轴的位置
ax.set_xticklabels(data_a, rotation=0, fontsize=10)  # x轴
ax.set_xlabel('关联规则')  # x轴名称
# 2.设置两个柱
total_width, n = 0.8, 2  # n表示柱子的种类数
width = total_width / n  # 得到每个柱子的宽度
place = place - (total_width - width) / 2  # 得到柱子的每个起始位置
plt.bar(place, data_b, width=width, label='置信度')  # 柱1
plt.bar(place + width, data_c, width=width, label='提升度')  # 柱2
# 3.设置数字在柱上显示
for i in range(size):
    plt.text(place[i], data_b[i] + 0.01, '%.3f' % data_b[i], ha='center', va='bottom',
             fontsize=10)  # 设置在柱1的顶部显示数字--保留三位小数，在柱子上方0.01处居中显示
    plt.text(place[i] + width, data_c[i] + 0.01, '%.3f' % data_c[i], ha='center', va='bottom',
             fontsize=10)  # 设置在柱2的顶部显示数字
plt.legend()
plt.show()

6. Complete code and data files

https://files.cnblogs.com/files/blogs/745215/%E5%85%B3%E8%81%94%E8%A7%84%E5%88%99%E6%8C%96%E6%8E%98.zip?t=1671788956

Seven. Summary

Mining association rules can provide references for product placement. For example, yogurt and whole milk can be placed together, and it can also be combined with product types to provide reference for different types of product placement.

Big Data-A Case Study of Association Rules Mining

Guess you like