Tianchi Sai: Visual analysis of Taobao user shopping behavior data

Table of contents

Preface

1. Introduction to the competition questions

2. Data cleaning, feature construction, and feature visualization

1. Handling of missing data and duplicate values

2. Date separation, PV and UV construction

3.PV and UV visualization

4. User behavior visualization

4.1 Area chart of each behavior (taking UV as an example)

4.2 Heat map of each behavior

5. Conversion rate visualization

3. RFM model

1. Build R, F, M

2. Statistical distribution of RFM data

3. Calculate RFM scores and combinations

4.RFM combination column chart and score pie chart visualization

5.RFM 3D column chart display

4. Product type correlation analysis

4.1. Extract association rules

4.2. Product association rules relationship diagram

4.3. Product word cloud chart


Preface

The competition data set has more than 10 million and 4 features. I mainly analyze and visualize the data set from RFM customer segmentation and product correlation analysis. In addition to using basic Matplotlib for visualization, I also use Visualization using pyecharts is a good exercise for those who like to use Python to process big data. Reference Baseline for this competition:Taobao user shopping behavior data visualization analysis baseline_Tianchi notebook-Alibaba Cloud Tianchi


1. Introduction to the competition questions

2014 was a year of rapid development for Alibaba Group’s mobile e-commerce business. For example, in the 2014 Double 11 promotion, mobile transactions accounted for 42.6%, exceeding 24 billion yuan. Compared with the PC era, mobile network access can be done anytime and anywhere, with richer scene data, such as user location information, user access time patterns, etc.

The purpose of this visual analysis is to analyze desensitized user behavior data (including browsing, collection, purchase and purchase data), using Python, Numpy, Pandas and Matplotlib tools to complete visual analysis to help players better Understand data and generate business insights.

This analysis data provides complete behavioral data of 10,000 users: user_action.csv. In order to simplify the problem, compare it to the original data set, we removed the user_geohash field which is mostly empty.

Field Field description Extraction instructions
user_id User ID Sampling & field masking
item_id product identification Field desensitization
behavior_type Type of user behavior towards products Including browsing, collecting, adding to shopping cart, and purchasing, the corresponding values ​​are 1, 2, 3, and 4 respectively.
item_category Product classification identification Field desensitization
time Behavior time Accurate to the hour level

Note: The data includes the mobile behavior data of sampled 1W users within a month (11.18~12.18). Compared with the Algorithm Challenge, this visual analysis task removed the user_geohash field. At the same time, for the convenience of calculation, the data size was also reduced from the original 100W user behavior data of the Algorithm Challenge to 1W.

2. Data cleaning, feature construction, and feature visualization

1. Handling of missing data and duplicate values

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from pyecharts import options as opts
from pyecharts.charts import Bar3D,Bar,Pie,Funnel,Line
df = pd.read_csv('/user_action.csv')
df.shape
# (12256906, 5)
# 一千多万的数据

df.isnull().sum() # 缺失值查看
# 本次数据无缺失值
print(df.duplicated().sum()) # 判断重复值
df.drop_duplicates(keep='first',inplace = True) # 去重
print(df.shape)
# 6043527       重复值有600多万
# (6213379, 5)  去重后剩下600多万的数据

2. Date separation, PV and UV construction

df['date'] = df['time'].map(lambda x: x.split(' ')[0]) 
df['hour'] =df['time'].map(lambda x: x.split(' ')[1])
df.loc[:,'data_now']='2014-12-20'  # 为了RFN模型的R构建的特征
df.head()

Visits (PV): The full name is Page View, based on the number of times the user refreshes the Taobao page, each time the user refreshes the page Page or opening a new page will be recorded as a visit.

Unique Visitor (UV): The full name is Unique Visitor. If a user visits Taobao multiple times, it will only be recorded once. Those who are familiar with SQL Friends will know that it is essentially a unique operation.

pv_daily = df.groupby('date')['user_id'].count()
pv_daily = pv_daily.reset_index() 
pv_daily = pv_daily.rename(columns={'user_id':'pv_daily'})

pv_hour = df.groupby('hour')['user_id'].count()
pv_hour = pv_hour.reset_index()
pv_hour = pv_hour.rename(columns={'user_id':'pv_hour'})

uv_daily = df.groupby('date')['user_id'].apply(lambda x: len(x.unique()))
uv_daily = uv_daily.reset_index()
uv_daily = uv_daily.rename(columns = {'user_id':'uv_daily'})

uv_hour = df.groupby('hour')['user_id'].apply(lambda x: len(x.unique()))
uv_hour = uv_hour.reset_index()
uv_hour = uv_hour.rename(columns={'user_id':'uv_hour'})

3.PV and UV visualization

import matplotlib.dates as mdates
plt.figure(figsize=(14,10))
sns.set_style('dark')

plt.subplot(2, 2, 1)
ax=sns.lineplot(x='date',y='pv_daily',data=pv_daily)
plt.xticks(rotation=45,horizontalalignment='right',fontweight='light')
locator = mdates.DayLocator(interval=3)  # 每隔3天显示日期 
ax.xaxis.set_major_locator(locator) 
plt.title('pv_daily')

plt.subplot(2, 2, 2)
ax1=sns.lineplot(x='date',y='uv_daily',data=uv_daily)
plt.title('uv_daily')
plt.xticks(rotation=45,horizontalalignment='right',fontweight='light')
ax1.xaxis.set_major_locator(locator)

plt.subplot(2, 2, 3)
ax2=sns.lineplot(x='hour',y='pv_hour',data=pv_hour)
plt.title('pv_hour')
locator1 = mdates.DayLocator(interval=3)
ax2.xaxis.set_major_locator(locator1)

plt.subplot(2, 2, 4)
ax3=sns.lineplot(x='hour',y='uv_hour',data=uv_hour)
plt.title('uv_hour')
ax3.xaxis.set_major_locator(locator1)

plt.subplots_adjust(wspace=0.4,hspace=0.8)  # 调整图间距
plt.show()

PV and UV reach a peak on Double 12 and reach a trough at 3-6 am. At the same time, the data of the Double 12 day can also be extracted separately to visualize whether the traffic distribution at each time point is different.

4. User behavior visualization

4.1 Area chart of each behavior (taking UV as an example)

behavior = df.groupby(['behavior_type','date'])['user_id'].apply(lambda x: len(x.unique()))
behavior = behavior.reset_index()
behavior = behavior.rename(columns = {'user_id':'uv'})

behavior1=behavior[behavior['behavior_type']==1].rename(columns = {'uv':'浏览'})
behavior2=behavior[behavior['behavior_type']==2].reset_index().rename(columns = {'uv':'收藏'})
behavior3=behavior[behavior['behavior_type']==3].reset_index().rename(columns = {'uv':'加购'})
behavior4=behavior[behavior['behavior_type']==4].reset_index().rename(columns = {'uv':'购买'})

result = pd.concat([behavior1, behavior2,behavior3,behavior4], axis=1)
result =result.loc[:,~result.columns.duplicated()] #删除同名列,保留前面一项
result = result.drop(labels=['behavior_type','index'], axis=1)
result.head()

# 面积图
x = behavior1['date'].values.tolist()
y1 = behavior1['浏览'].values.tolist()
y2 = behavior2['收藏'].values.tolist()
y3 = behavior3['加购'].values.tolist()
y4 = behavior4['购买'].values.tolist()
c = (
    Line()
    .add_xaxis(x)
    .add_yaxis("浏览", y1, is_smooth=True)
    .add_yaxis("收藏", y2, is_smooth=True)
    .add_yaxis("加购", y3, is_smooth=True)
    .add_yaxis("购买", y4, is_smooth=True)
    .set_series_opts(
        areastyle_opts=opts.AreaStyleOpts(opacity=0.5),
        label_opts=opts.LabelOpts(is_show=False),
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(title="各个行为UV面积图"),
        xaxis_opts=opts.AxisOpts(
            axistick_opts=opts.AxisTickOpts(is_align_with_label=True),
            is_scale=False,
            boundary_gap=True,
        ),
    )
)
c.render_notebook()

You can see that there is an obvious peak on Double Twelve. The advantage of pyecharts is that it can draw interactive graphics. You can click on the data you want to see to display it individually.

4.2 Heat map of each behavior

plt.rcParams['font.sans-serif'] = ['SimHei'] # 显示中文
correlation_matrix=result.corr()
plt.figure(figsize=(8,6))
sns.heatmap(correlation_matrix,vmax=0.9,linewidths=0.05,cmap="GnBu_r",annot=True,annot_kws={'size': 15})
plt.title("uv", fontsize = 20)

 Basically, there is a strong correlation, but the correlation between collection and purchase is relatively low compared to the others. After all, after collecting, you still shop around.

5. Conversion rate visualization

behavior_type = df.groupby(['behavior_type'])['user_id'].count()

click_num, fav_num, add_num, pay_num =  behavior_type[1], behavior_type[2], behavior_type[3], behavior_type[4]
fav_add_num = fav_num + add_num 
behavior_type1=pd.DataFrame([click_num, fav_add_num, pay_num],index=["浏览", "收藏+加购", "购买"],columns=["A"])
behavior_type1['B']=(100*behavior_type1['A']/5535879).round(2)
# 漏斗图
x = ["浏览", "收藏+加购", "购买"]
y = behavior_type1['B'].values.tolist()
c = (
    Funnel()
    .add("",[list(z) for z in zip(x,y)])
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}占比{c}%"))
    .set_global_opts(title_opts=opts.TitleOpts(title="转化率"))
)
c.render_notebook()

 The overall conversion rate from browsing to purchasing is less than 2%, and it was 4.65% on Double 12.

3. RFM model

Since this data set does not have commodity prices, it was constructed using a different approach.

R: The date delayed by two days from the last date of the data set is used as the benchmark to construct this indicator.

F: Filter the number of days on which the purchase behavior occurred in the dates of the data set by the customer ID as the frequency.

M: The number of items purchased by the customer.

1. Build R, F, M

df1=df[df['behavior_type']==4] # 取出已成交的数据

df1['day']=(pd.to_datetime(df1['data_now'])- pd.to_datetime(df1['date'])).apply(lambda x : x.days)
data_r = df1.groupby(['user_id'])['day'].agg('min').reset_index().rename(columns = {'day':'R'})
data_f = df1.groupby(['user_id'])['date'].apply(lambda x: len(x.unique())).reset_index().rename(columns = {'date':'F'})
data_m = df1.groupby(['user_id'])['item_id'].count().reset_index().rename(columns = {'item_id':'M'})
RFM= pd.concat([data_r,data_f,data_m], axis=1)
RFM =RFM.loc[:,~RFM.columns.duplicated()]

2. Statistical distribution of RFM data

RFM.describe().T

3. Calculate RFM scores and combinations

# 定义区间边界
r_bins = [0,3,9,32] # 注意起始边界小于最小值
f_bins = [0,2,8,30] 
m_bins = [0,4,15,745]
# RFM分箱得分
RFM['r_score'] = pd.cut(RFM['R'], r_bins, labels=[i for i in range(len(r_bins)-1,0,-1)])  # 计算R得分  倒序排列
RFM['f_score'] = pd.cut(RFM['F'], f_bins, labels=[i+1 for i in range(len(f_bins)-1)])  # 计算F得分
RFM['m_score'] = pd.cut(RFM['M'], m_bins, labels=[i+1 for i in range(len(m_bins)-1)])  # 计算M得分
# 方法1:计算RFM总得分
RFM[['r_score','f_score','m_score']] = RFM[['r_score','f_score','m_score']].apply(np.int32)
RFM['rfm_score'] = RFM['r_score']  + RFM['f_score']  + RFM['m_score'] 
# 方法2:RFM组合
RFM=RFM.applymap(str)
RFM['rfm_group']=RFM['r_score']+RFM['f_score']+RFM['m_score']
RFM.head()

The R here is distinguished in reverse order, that is, the closer the purchase date is, the greater the score is.

4.RFM combination column chart and score pie chart visualization

#柱图
RFM_new = RFM.groupby(['rfm_group','rfm_score'])['user_id'].count().reset_index().rename(columns = {'user_id':'number'})
RFM_new = RFM_new.rename_axis('index').reset_index()
l1=RFM_new['rfm_group'].values.tolist()
l2=RFM_new['number'].values.tolist()
c = Bar({"theme": ThemeType.DARK}) # 背景主题
c.add_xaxis(l1)
c.add_yaxis("类别数量", l2)
c.set_global_opts(title_opts=opts.TitleOpts(title="RFM类别数量"),
                  yaxis_opts=opts.AxisOpts(name="数量"),
                  xaxis_opts=opts.AxisOpts(name="组别"))
c.render_notebook()

#饼图
RFM_score=RFM_new.groupby(['rfm_score'])['number'].sum().reset_index()   # 组别占比
RFM_score['score_pt']=(RFM_score['number']/RFM_score['number'].sum()).round(2)
x_data = RFM_score['rfm_score'].values.tolist()
y_data = RFM_score['score_pt'].values.tolist()
c = (
    Pie()
    .add(
        "",
        [list(z) for z in zip(x_data, y_data)],
        radius=["30%", "75%"],
        center=["50%", "50%"],
        rosetype="radius",
        is_clockwise=True,
        label_opts=opts.LabelOpts(is_show=True),
    )
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}占比{d}%"))
    .set_global_opts(title_opts=opts.TitleOpts(title="类别占比"))
)
c.render_notebook()

 

 In terms of categories, the 222 and 111 categories account for a relatively high proportion, but there are also many 333 high-value categories. Categories with a score of 6 account for 24%, which is the highest. Each indicator has a score of 1 for low, 2 for average, and 3 for high. For example, 111 represents a low-value customer group. Of course, the indicators for specific groupings must be defined according to specific scenarios. The following provides a reference on group customer segmentation.

 5.RFM 3D column chart display

data=RFM_new.values.tolist()
group = list(set(RFM_new.iloc[:, 1]))
score = list(set(RFM_new.iloc[:, 2]))
data2 = [[d[1], d[2], d[3]] for d in data]
(
    Bar3D(init_opts=opts.InitOpts(width="1000px", height="600px"))
    .add(
        series_name="",
        data=data2,
        xaxis3d_opts=opts.Axis3DOpts(type_="category", data=group,name='group'),
        yaxis3d_opts=opts.Axis3DOpts(type_="category", data=score,name='score'),
        zaxis3d_opts=opts.Axis3DOpts(type_="value",name='number'),
    )
    .set_global_opts(
        visualmap_opts=opts.VisualMapOpts(
            max_=1500,
            range_color=[
                "#313695",
                "#4575b4",
                "#74add1",
                "#abd9e9",
                "#e0f3f8",
                "#ffffbf",
                "#fee090",
                "#fdae61",
                "#f46d43",
                "#d73027",
                "#a50026",
            ],
        ),
        title_opts=opts.TitleOpts(title="RFM客户分群3D可视化") # 设置总标题
    )
    .render_notebook()
)         

 The advantage of this graph is that it can be rotated 360 degrees to view various data information. The interactivity of pyecharts is well reflected here.

4. Product type correlation analysis

4.1. Extract association rules

Support: Indicates the proportion of transactions that contain both A and B among all transactions. If P(A) is used to represent the proportion of transactions using A, then Support=P(A&B) is the ratio of the number of times both occur at the same time to the total number of times

Confidence (credibility): Indicates the proportion of transactions containing A that also contain B, that is, the proportion of transactions that contain both A and B to the transactions that contain A. Formula expression: Confidence=P(A&B)/P(A)

Lift (lift degree): Indicates the ratio of "the proportion of transactions that include A that also include B transactions" and "the proportion that includes B transactions". Formula expression: Lift=( P(A&B)/P(A))/P(B)=P(A&B)/P(A)/P(B).

Lift degree reflects the correlation between A and B in the association rules. Lift degree > 1 and the higher it indicates, the higher the positive correlation. Lift degree < 1 and the lower it indicates, the negative correlation is higher. Lift degree = 1 means There is no correlation.

For detailed calculations of these three concepts, please refer to:Support, confidence and improvement in correlation analysis_sanqima's blog-CSDN blog

import apriori # 导入关联算法
order_ids = pd.unique(df1['user_id'])
order_records = [df1[df1['user_id']==each_id]['item_category'].tolist() for each_id in order_ids]
minS = 0.01  # 定义最小支持度阀值
minC = 0.1  # 定义最小置信度阀值
L, suppData = apriori.apriori(order_records, minSupport=minS)  # 计算得到满足最小支持度的规则
rules = apriori.generateRules(order_records, L, suppData, minConf=minC)
model_summary = 'data record: {1} \nassociation rules count: {0}'  # 展示数据集记录数和满足阀值定义的规则数量
print(model_summary.format(len(rules), len(order_records)),'\n','-'*60)  # 使用str.format做格式化输出
rules_all = pd.DataFrame(rules, columns=['item1', 'item2', 'instance', 'support', 'confidence',
                                  'lift'])  # 创建频繁规则数据框
rules_sort = rules_all.sort_values(['lift'],ascending=False)
print(rules_sort[:20])

You can see that there are 177 rules that meet the set rules. They are arranged in reverse order according to the degree of improvement and the top 20 are displayed. Among them, the confidence level of product categories 10661 and 9516 is 0.5082, and the improvement degree is 10.2168, which is an effective strong association rule.

4.2. Product association rules relationship diagram

Here we only extract the top ten products for visualization, using the networkx library.

rules_sort_filt=rules_sort[rules_sort['lift']>2] #筛选提升度大于2的规则
# 汇总每个item出现的次数
display_data=rules_sort_filt.iloc[:,:3]
item1=display_data[['item1','instance']].rename(index=str,columns={"item1":"item"})
item2=display_data[['item2','instance']].rename(index=str,columns={"item2":"item"})
item_concat=pd.concat((item1,item2),axis=0)
item_count=item_concat.groupby(['item'])['instance'].sum()
# 取出规则最多的TOP N items
control_num = 10
top_n_rules = item_count.sort_values(ascending=False).iloc[:control_num]
top_n_items = top_n_rules.index  # 对应的就是每个类别项
top_rule_list = [all((item1 in top_n_items, item2 in top_n_items)) for item1,item2 in zip(display_data['item1'],display_data['item2'])]  #all函数进行布尔筛选
top_display_data = display_data[top_rule_list] 
# 取出前十商品ID
top10=top_n_rules.index
top101=[list(x) for x in top10]   #二维
n = np.array(top101).flatten()  # 转一维
# 由于item1及item2都是集合形式,这里进行转化
lst=[]
lst1=[]
for y,z in zip(top_display_data['item1'],top_display_data['item2']):   
    lst.append(list(y))
    lst1.append(list(z))
n1 = np.array(lst).flatten()
n2 = np.array(lst1).flatten()
n1=pd.DataFrame(n1, columns=['item3'])
n2=pd.DataFrame(n2, columns=['item4'])
n3=pd.DataFrame(top_display_data['instance'].values.tolist(), columns=['instance'])
n4=pd.concat((n1,n2,n3),axis=1)
n4

DataFrame constructed before drawing

import networkx as nx
plt.figure(figsize=(14,10))
res = n4.values.tolist()
for i in range(len(res)):
    res[i][2] = dict({'weight': res[i][2]})
res = [tuple(x) for x in res]
g = nx.Graph()
g.add_nodes_from(n)
g.add_edges_from(res)
pos = nx.spring_layout(g)
nx.draw(g,pos,node_color='#7FFF00', node_size=1500, alpha=0.6,with_labels=True)
labels = nx.get_edge_attributes(g,'weight') 
nx.draw_networkx_edge_labels(g,pos,edge_labels=labels)
plt.show()

 This relationship diagram is randomly arranged, so the graphics you display will be different every time, but the data will not change. Each number represents the number of times the association between two product categories appears.

4.3. Product word cloud chart

Since the number of product IDs after grouping and aggregation is more than 90,000, only the top 30 with the largest number after sorting are used for word cloud display.

from pyecharts.charts import WordCloud
from pyecharts.globals import SymbolType
buy=df1.groupby(['item_id'])['user_id'].count().reset_index().sort_values(by=['user_id'],ascending=False)
buy_freq=buy.head(30).values.tolist()
# 绘图
c = (
    WordCloud()
    .add("", buy_freq, word_size_range=[20, 100], shape=SymbolType.DIAMOND)
    .set_global_opts(title_opts=opts.TitleOpts(title="商品词云图"))
)
c.render_notebook()

You can see the most purchased product ID: 167074648


Summarize

This visualization uses pyecharts more, which can draw interactive graphics, which matplotlib cannot achieve. These display diagrams are just the tip of the iceberg of pyecharts. Interested friends can refer to the official website to learn. Of course, specific data visualization is still required. Choose the appropriate tool based on the specific scenario.

Guess you like

Origin blog.csdn.net/weixin_46685991/article/details/130448955