Python Association Rules Mining Shadow Users

Python association rules mining couples, friends, girlfriends, scumbags and dogs

Public number: You Er Hut
Author: Peter
Editor: Peter

Hello everyone, my name is Peter~

This article explains the application of an algorithm in machine learning: association rule analysis

The whole story starts with a campus card. I believe that all of you have used the campus card. It is a campus information integration and management system that integrates personal identity authentication, campus consumption, data sharing and other functions. It stores a large amount of data, including: student consumption, dormitory access control, library access, etc.

This article uses the consumption details of the student card of a university in Nanjing from April 1 to 20, 2019. From statistical visualization analysis and association rule analysis , it is found that the use of the student card and the couples, friends, girlfriends and scumbags among the students Interesting information on men and singles.

The address of the dataset used is as follows: github.com/Nicole456/A…

Import Data

import pandas as pd
import numpy as np
import datetime 
import plotly_express as px
import plotly.graph_objects as go
复制代码

1. Data 1: Basic information of each student's campus card

2. Data 2: Detailed data of each consumption and recharge of the campus card

3. Data 3: detailed data of access control

data size

In [8]:

print("df1: ", df1.shape)
print("df2: ", df2.shape)
print("df3: ", df3.shape)
df1:  (4341, 5)
df2:  (519367, 14)
df3:  (43156, 6)
复制代码

missing value

 # 每列缺失值
df1.isnull().sum() 
# 每列的缺失值占比
df2.apply(lambda x : sum(x.isnull())/len(x), axis=0)
复制代码

Comparison of the number of people

Number of different genders

Number of different professionals

In [16]:

df5 = df1["Major"].value_counts().reset_index()

df5.columns = ["Major","Number"]
df5.head()
复制代码

Different professions and genders

In [18]:

df6 = df1.groupby(["Major","Sex"])["CardNo"].count().reset_index()
df6.head()
复制代码

fig = px.treemap(
    df6,
    path=[px.Constant("all"),"Major","Sex"],  # 重点:传递数据路径
    values="CardNo",
    color="Major"   # 指定颜色变化的参数
)

fig.update_traces(root_color="maroon")
# fig.update_traces(textposition="top right")
fig.update_layout(margin=dict(t=30,l=20,r=25,b=30))

fig.show()
复制代码

Entry and exit information

Address information

In [21]:

# 1、处理address

address = df3["Address"].str.extract(r"(?P<Address_New>[\w]+)\[(?P<Out_In>[\w]+)\]")
address
复制代码

Entry and exit time

In [25]:

df8 = pd.merge(df3,df1,on="AccessCardNo")
df8.loc[:,'Date'] = pd.to_datetime(df8.loc[:,'Date'],format='%Y/%m/%d %H:%M',errors='coerce')

df8["Hour"] = df8["Date"].dt.hour
# df8["Minute"] = df8["Date"].dt.minute

# 进出门禁人数统计/小时
df9 = df8.groupby(["Hour","Out_In"]).agg({"AccessCardNo":"count"}).reset_index()
df9.head()
复制代码

# 准备画布
fig = go.Figure()

# 添加不同的数据
fig.add_trace(go.Scatter(  
    x=df9.query("Out_In == '出门'")["Hour"].tolist(),
    y=df9.query("Out_In == '出门'")["AccessCardNo"].tolist(),
    mode='lines + markers', # mode模式选择
    name='出门')) # 名字

fig.add_trace(go.Scatter(  
    x=df9.query("Out_In == '进门'")["Hour"].tolist(),
    y=df9.query("Out_In == '进门'")["AccessCardNo"].tolist(),
    mode='lines + markers', 
    name='进门')) 

fig.show()
复制代码

consumer information

In [30]:

# 数据合并  只取出两个字段:卡号和性别

df10 = pd.merge(df2,df1[["CardNo","Sex"]],on="CardNo")
复制代码

Merge information

In [32]:

df10["Card_Sex"] = df10["CardNo"].apply(lambda x: str(x)) + "_" + df10["Sex"]
复制代码

main location

In [33]:

# Card_Sex:统计消费人次
# Money:统计消费金额

df11 = (df10.groupby("Dept").agg({"Card_Sex":"count","Money":sum})
        .reset_index().sort_values("Money",ascending=False))

df11.head(10)
复制代码

fig = px.bar(df11,x="Dept",y="Card_Sex")
fig.update_layout(title_text='不同地方的消费人数',xaxis_tickangle=45) 

fig.show()
复制代码

fig = px.bar(df11,x="Dept",y="Money")
fig.update_layout(title_text='不同地方的消费金额',xaxis_tickangle=45) 

fig.show()
复制代码

Association Rules Mining

time processing

Time processing is mainly two points:

  • time format conversion
  • Time discretization: one type every 5 minutes

Here we default: if two times are in the same type, consider two people to consume together

import datetime

def change_time(x):
    # 转成标准时间格式
    result = str(datetime.datetime.strptime(x, "%Y/%m/%d %H:%M"))
    return result

def time_five(x):
    # ‘2022-02-24 15:46:09’ ---> '2022-02-24 15_9'
    res1 = x.split(":")[0]
    res2 = str(round(int(x.split(":")[1]) / 5))
    return res1 + "_" + res2
  
  
df10["New_Date"] = df10["Date"].apply(change_time)
df10["New_Date"] = df10["New_Date"].apply(time_five)
df10.head(3)
复制代码

Mention person information for each time type:

# 方式1

df11 = df10.groupby(["New_Date"])["Card_Sex"].apply(list).reset_index()
# 每个列表中的元素去重
df11["Card_Sex"] = df11["Card_Sex"].apply(lambda x: list(set(x)))
all_list = df11["Card_Sex"].tolist()

# 方式2
# all_list = []
# for i in df10["New_Date"].unique().tolist():
#     lst = df10[df10["New_Date"] == i]["Card_Sex"].unique().tolist()
#     all_list.append(lst)
复制代码

frequent itemset search

In [44]:

import efficient_apriori as ea

# itemsets:频繁项  rules:关联规则
itemsets, rules = ea.apriori(all_list,
                min_support=0.005,  
                min_confidence=1
               )
复制代码

A person

One person consumes the most data: 2565 pieces of data, after all, there are many singles!

len(itemsets[1])  # 2565条

# 部分数据
{('181539_男',): 52,
 ('180308_女',): 47,
 ('183262_女',): 100,
 ('182958_男',): 88,
 ('180061_女',): 83,
 ('182936_男',): 80,
 ('182931_男',): 87,
 ('182335_女',): 60,
 ('182493_女',): 75,
 ('181944_女',): 67,
 ('181058_男',): 93,
 ('183391_女',): 63,
 ('180313_女',): 82,
 ('184275_男',): 69,
 ('181322_女',): 104,
 ('182391_女',): 57,
 ('184153_女',): 31,
 ('182711_女',): 40,
 ('181594_女',): 36,
 ('180193_女',): 84,
 ('184263_男',): 61,
复制代码

Two people

len(itemsets[2])  # 378条
复制代码

Looking at all the data, the following results were counted:

('180433_男', '180499_女'): 34
# 可疑渣男1    
('180624_男', '181013_女'): 36,
('180624_男', '181042_女'): 37,
# 可疑渣男2
('181461_男', '180780_女'): 38,    
('181461_男', '180856_女'): 34,
    
('181597_男', '183847_女'): 44,
    
('181699_男', '181712_女'): 31,
    
('181889_男', '180142_女'): 33,
# 可疑渣男3:NB
('182239_男', '182304_女'): 39,
('182239_男', '182329_女'): 40,
('182239_男', '182340_女'): 37,
('182239_男', '182403_女'): 35,
    
('182873_男', '182191_女'): 31,

('183343_男', '183980_女'): 44,
复制代码

1. Suspicious boy 1-180624

Go back to the original data and check the intersection of time consumption between him and different girls.

(1) Intersection with girl 181013:

  • April 1st at 7.36am: We should have had breakfast together; 11:54am had lunch together
  • The intersection of different time points such as 4.10 and 4.12

(2) Intersection with girl 181042:

2. Look at the suspicious scumbag 3

This dude is really amazing~ Data mining shows that there is a certain relationship with 4 girls at the same time!

('182239_男', '182304_女'): 39
('182239_男', '182329_女'): 40
('182239_男', '182340_女'): 37
('182239_男', '182403_女'): 35
复制代码

In addition to the possible relationship between boyfriend and girlfriend, there are more basic friends or girlfriends in the 2 metadata:

('180450_女', '180484_女'): 35,
('180457_女', '180493_女'): 31,
('180460_女', '180496_女'): 31,
('180493_女', '180500_女'): 47,
('180504_女', '180505_女'): 43,
('180505_女', '180506_女'): 35,
('180511_女', '181847_女'): 42,
('180523_男', '182415_男'): 34,
('180526_男', '180531_男'): 33,
('180545_女', '180578_女'): 41,
('180545_女', '180615_女'): 47,
('180551_女', '180614_女'): 31,
('180555_女', '180558_女'): 36,
('180572_女', '180589_女'): 31,
('181069_男', '181103_男'): 44,
('181091_男', '181103_男'): 33,
('181099_男', '181102_男'): 31,
('181099_男', '181107_男'): 34,
('181102_男', '181107_男'): 35,
('181112_男', '181117_男'): 43,
('181133_男', '181136_男'): 52,
('181133_男', '181571_男'): 45,
('181133_男', '181582_男'): 33,
复制代码

3-4 people

The data of 3-4 yuan may be with classmates or friends in a dormitory, and the relative amount will be relatively small:

len(itemsets[3])  # 18条

{('180363_女', '181876_女', '183979_女'): 40,
 ('180711_女', '180732_女', '180738_女'): 35,
 ('180792_女', '180822_女', '180849_女'): 35,
 ('181338_男', '181343_男', '181344_男'): 40,
 ('181503_男', '181507_男', '181508_男'): 33,
 ('181552_男', '181571_男', '181582_男'): 39,
 ('181556_男', '181559_男', '181568_男'): 35,
 ('181848_女', '181865_女', '181871_女'): 35,
 ('182304_女', '182329_女', '182340_女'): 36,
 ('182304_女', '182329_女', '182403_女'): 32,
 ('183305_女', '183308_女', '183317_女'): 32,
 ('183419_女', '183420_女', '183422_女'): 49,
 ('183419_女', '183420_女', '183424_女'): 45,
 ('183419_女', '183422_女', '183424_女'): 48,
 ('183420_女', '183422_女', '183424_女'): 51,
 ('183641_女', '183688_女', '183690_女'): 32,
 ('183671_女', '183701_女', '183742_女'): 35,
 ('183713_女', '183726_女', '183737_女'): 36}
复制代码

4 There is only one metadata:

Summarize

Association rule analysis is a classic data mining algorithm, which is widely used in consumer detail data, supermarket shopping basket data, financial insurance, credit cards and other fields.

When we use association analysis technology to mine frequently occurring combinations and strong association rules, we can specify corresponding marketing strategies or find the relationship between different objects.

The above data mining process actually has certain defects:

  • The constraints are too wide: only group statistics based on the time interval type, ignoring the students' majors, consumption locations and other information
  • Time is too narrow: 5 minutes interval is too narrow and will filter out a lot of information

Guess you like

Origin juejin.im/post/7079678313089728519