Shopping basket analysis and product positioning analysis

What is Product Relevance Analysis

Correlation analysis is to discover the connection between different commodities (items) in the transaction database. It is mainly used in scenarios such as e-commerce website recommendation and offline retail store product placement.
Association rules:

  • Support (support): the probability that a certain item is included in the data set. For example, in 1000 commodity transactions, the number of A and B appearing at the same time is 50 times, then the support degree of this association is 5%.
  • Confidence: When A has already appeared in the data set, the probability of B occurring. The calculation formula of confidence is: the probability of A and B appearing at the same time/the probability of A appearing.
  • The promotion degree is the ratio of the possibility of purchasing product B under the condition of purchasing product A to the possibility of purchasing product B without this condition (confidence degree/unconditional probability). Generally speaking, if the lift is less than 1, the recommendation rule should not be used.

For example, there is the following shopping cart, ABCDE represent different products
insert image description here

rule Support Confidence lift
A–D 0.4 0.67 1.12
C–A 0.4 0.5 0.83
A–C 0.4 0.67 0.83
B&C–D 0.2 0.33 0.55

How is it calculated?
Support degree: Among products A and D, they appear twice in the shopping basket, accounting for 40% of the total shopping basket Confidence degree: The probability of
A appearing is 0.6, according to the formula A and D appear simultaneously Probability/A appearance probability, 0.4÷0.6=0.67 Lifting degree: Confidence degree/unconditional probability, 0.67÷(unconditional probability of D: 3÷5=0.6) =
1.12

What is Market Basket Analysis

The industry generally refers to the sum of commodities purchased by a single customer as a shopping basket, and shopping basket analysis is a correlation analysis for commodities. Because this association analysis was originally widely used in supermarkets, it is also called "shopping basket analysis". Shopping basket analysis has an Apriori algorithm (priori algorithm)
shopping basket analysis points:

  • find the right combination of items
  • Find the user's purchase time corresponding to different combinations
  • Find the user's purchase order corresponding to different combinations

For example ,
there are multiple transaction orders as follows
insert image description here
Step 1: Calculate the transaction frequency of different commodities
insert image description here
Step 2: Filter commodities according to the minimum support (assuming the minimum support is 50%)
insert image description here
and calculate the transaction frequency of different combinations
Combination calculation formula:
n represents the quantity of goods, and r represents the quantity of goods in a combination
n represents the quantity of commodities, and r represents the quantity of commodities in a combination

insert image description here
Step 4: Filter combinations according to the minimum support (minimum support is 50%)

insert image description here
Step 5: Calculate the transaction frequency of different combinations (three commodities)
insert image description here
Step 6: Screen the combinations according to the minimum support (minimum support is 50%)
insert image description here

Applications of Market Basket Analysis

Assume that there are the following orders: shopping basket id and sold products.
insert image description here
The situation of the hot-selling products is as follows
insert image description here
. We use the Apriori algorithm for the data. First, we must preprocess the data and integrate the products under the same order (id) together.

import pandas as pd
inputfile='/content/GoodsOrder.csv'
data = pd.read_csv(inputfile,encoding = 'gbk')

# 根据id对“Goods”列合并,并使用“,”将各商品隔开
data['Goods'] = data['Goods'].apply(lambda x:','+x)
data = data.groupby('id').sum().reset_index()

# 对合并的商品列转换数据格式
data['Goods'] = data['Goods'].apply(lambda x :[x[1:]])
data_list = list(data['Goods'])

# 分割商品名为每个元素
data_translation = []
for i in data_list:
    p = i[0].split(',')
    data_translation.append(p)

You can get the following format
insert image description here

from numpy import *
 
def loadDataSet():
    return [['a', 'c', 'e'], ['b', 'd'], ['b', 'c'], ['a', 'b', 'c', 'd'], ['a', 'b'], ['b', 'c'], ['a', 'b'],
            ['a', 'b', 'c', 'e'], ['a', 'b', 'c'], ['a', 'c', 'e']]
 
def createC1(dataSet):
    C1 = []
    for transaction in dataSet:
        for item in transaction:
            if not [item] in C1:
                C1.append([item])
    C1.sort()
    # 映射为frozenset唯一性的,可使用其构造字典
    return list(map(frozenset, C1))     
    
# 从候选K项集到频繁K项集(支持度计算)
def scanD(D, Ck, minSupport):
    ssCnt = {
    
    }
    for tid in D:   # 遍历数据集
        for can in Ck:  # 遍历候选项
            if can.issubset(tid):  # 判断候选项中是否含数据集的各项
                if not can in ssCnt:
                    ssCnt[can] = 1  # 不含设为1
                else:
                    ssCnt[can] += 1  # 有则计数加1
    numItems = float(len(D))  # 数据集大小
    retList = []  # L1初始化
    supportData = {
    
    }  # 记录候选项中各个数据的支持度
    for key in ssCnt:
        support = ssCnt[key] / numItems  # 计算支持度
        if support >= minSupport:
            retList.insert(0, key)  # 满足条件加入L1中
            supportData[key] = support  
    return retList, supportData
 
def calSupport(D, Ck, min_support):
    dict_sup = {
    
    }
    for i in D:
        for j in Ck:
            if j.issubset(i):
                if not j in dict_sup:
                    dict_sup[j] = 1
                else:
                    dict_sup[j] += 1
    sumCount = float(len(D))
    supportData = {
    
    }
    relist = []
    for i in dict_sup:
        temp_sup = dict_sup[i] / sumCount
        if temp_sup >= min_support:
            relist.append(i)
# 此处可设置返回全部的支持度数据(或者频繁项集的支持度数据)
            supportData[i] = temp_sup
    return relist, supportData
 
# 改进剪枝算法
def aprioriGen(Lk, k):
    retList = []
    lenLk = len(Lk)
    for i in range(lenLk):
        for j in range(i + 1, lenLk):  # 两两组合遍历
            L1 = list(Lk[i])[:k - 2]
            L2 = list(Lk[j])[:k - 2]
            L1.sort()
            L2.sort()
            if L1 == L2:  # 前k-1项相等,则可相乘,这样可防止重复项出现
                # 进行剪枝(a1为k项集中的一个元素,b为它的所有k-1项子集)
                a = Lk[i] | Lk[j]  # a为frozenset()集合
                a1 = list(a)
                b = []
                # 遍历取出每一个元素,转换为set,依次从a1中剔除该元素,并加入到b中
                for q in range(len(a1)):
                    t = [a1[q]]
                    tt = frozenset(set(a1) - set(t))
                    b.append(tt)
                t = 0
                for w in b:
                    # 当b(即所有k-1项子集)都是Lk(频繁的)的子集,则保留,否则删除。
                    if w in Lk:
                        t += 1
                if t == len(b):
                    retList.append(b[0] | b[1])
    return retList

def apriori(dataSet, minSupport=0.2):
# 前3条语句是对计算查找单个元素中的频繁项集
    C1 = createC1(dataSet)
    D = list(map(set, dataSet))  # 使用list()转换为列表
    L1, supportData = calSupport(D, C1, minSupport)
    L = [L1]  # 加列表框,使得1项集为一个单独元素
    k = 2
    while (len(L[k - 2]) > 0):  # 是否还有候选集
        Ck = aprioriGen(L[k - 2], k)
        Lk, supK = scanD(D, Ck, minSupport)  # scan DB to get Lk
        supportData.update(supK)  # 把supk的键值对添加到supportData里
        L.append(Lk)  # L最后一个值为空集
        k += 1
    del L[-1]  # 删除最后一个空集
    return L, supportData  # L为频繁项集,为一个列表,1,2,3项集分别为一个元素

# 生成集合的所有子集
def getSubset(fromList, toList):
    for i in range(len(fromList)):
        t = [fromList[i]]
        tt = frozenset(set(fromList) - set(t))
        if not tt in toList:
            toList.append(tt)
            tt = list(tt)
            if len(tt) > 1:
                getSubset(tt, toList)
 
def calcConf(freqSet, H, supportData, ruleList, minConf=0.7):
    for conseq in H:  #遍历H中的所有项集并计算它们的可信度值
        conf = supportData[freqSet] / supportData[freqSet - conseq]  # 可信度计算,结合支持度数据
        # 提升度lift计算lift = p(a & b) / p(a)*p(b)
        lift = supportData[freqSet] / (supportData[conseq] * supportData[freqSet - conseq])
 
        if conf >= minConf and lift > 1:
            print(freqSet - conseq, '-->', conseq, '支持度', round(supportData[freqSet], 6), '置信度:', round(conf, 6),
                  'lift值为:', round(lift, 6))
            ruleList.append((freqSet - conseq, conseq, conf))
 
# 生成规则
def gen_rule(L, supportData, minConf = 0.7):
    bigRuleList = []
    for i in range(1, len(L)):  # 从二项集开始计算
        for freqSet in L[i]:  # freqSet为所有的k项集
            # 求该三项集的所有非空子集,1项集,2项集,直到k-1项集,用H1表示,为list类型,里面为frozenset类型,
            H1 = list(freqSet)
            all_subset = []
            getSubset(H1, all_subset)  # 生成所有的子集
            calcConf(freqSet, all_subset, supportData, bigRuleList, minConf)
    return bigRuleList
 
if __name__ == '__main__':
    dataSet = data_translation
    L, supportData = apriori(dataSet, minSupport = 0.02)
    rule = gen_rule(L, supportData, minConf = 0.35)

insert image description here

Product Positioning Analysis

insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/weixin_44820355/article/details/119066013