Hello everyone! I am a red panda ❤

In fact, this is a long-term outsourcing

Let me share with you the ideas

If you have any python-related error answers that you can't answer, or source code/module installation/ ~~women's clothing bosses are proficient~~ in skills, you can come here: ( https://jq.qq.com/?_wv=1027&k=2Q3YTfym )

Please add image description

1. Project Background

Detailed data on consumer product sales obtained by "scanning" the barcodes of individual products at the retail store's electronic point of sale. This data provides detailed information about the quantity, characteristics and value of items sold, as well as prices.

2. Data sources

< link >

3. Ask questions

Consumption analysis and user purchase pattern analysis
RFM and CLV Analysis
Mining of Association Rules of Different Categories of Commodities

4. Understand the data

Date: Purchase date
Customer_ID: User ID
Transaction_ID: Transaction ID
SKU_Category: Product category SKU code
SKU: The unique SKU code of the product
Quantity: Purchase quantity
Sales_Amount: Purchase amount

Please add image description

5. Data cleaning

1. Import data

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
# 更改设计风格
plt.style.use('ggplot')
plt.rcParams['font.sans-serif'] = ['SimHei']

np.__version__

pd.__version__

df = pd.read_csv('scanner_data.csv')
df.head()

df.info()

2. Select a subset

第一列为数据编号，已有索引故删除


df.drop(columns='Unnamed: 0', inplace=True)
df.info()


### 3.删除重复值

```go
df.duplicated().sum()

Data has no duplicate values

### 4.缺失值处理


df.isnull().sum()

Data with no missing values

### 5.标准化处理


df.dtypes


Date为对象类型，需要标准化为日期类型格式


df.Date = pd.to_datetime(df.Date, format='%d/%m/%Y')
df.dtypes

6. Outlier Handling

df[['Quantity','Sales_Amount']].describe()

The purchase quantity is less than 1 because the weighing unit is less than 1, not an abnormal value

Please add image description

6. Analysis content

1. Monthly analysis of consumption

(1) Trend analysis of total monthly consumption amount


df['Month'] = df.Date.astype('datetime64[M]')
df.head()


grouped_month = df.groupby('Month')


grouped_month.Sales_Amount.sum()


2018年1月数据可能统计不全，不纳入趋势分析


grouped_month.Sales_Amount.sum().head(12).plot()

As can be seen from the above figure: the consumption amount fluctuates greatly, of which the first quarter continued to rise, and the subsequent fluctuations were large, showing an overall upward trend

(2) Trend analysis of monthly transaction times

grouped_month.Transaction_ID.nunique().head(12).plot()

As can be seen from the above figure: the number of transactions fluctuated greatly, and the previous period showed an upward trend. After May, the number of transactions began to decline, fell to the lowest value in August, and then began to fluctuate and rebound, returning to the peak in December.

(3) Trend analysis of monthly merchandise purchase quantity

grouped_month.Quantity.sum().head(12).plot()

As can be seen from the above figure: the number of commodity purchases fluctuates greatly, and the overall trend is consistent with the number of transactions

(4) Trend analysis of monthly consumption numbers

grouped_month.Customer_ID.nunique().head(12).plot()

As can be seen from the above figure: the number of monthly purchasers can be simply divided into three stages. From January to May, it shows a continuous upward trend, from June to August, it shows a continuous downward trend, and from September to December, it shows a fluctuating upward trend.

2. User distribution analysis

(1) Distribution of new users

grouped_customer = df.groupby('Customer_ID')
grouped_customer.Date.min().value_counts().plot()

As can be seen from the above figure: the acquisition of new users is unstable, fluctuates greatly, and there is a slight downward trend as a whole

grouped_customer.Month.min().value_counts().plot()

It can be seen from the above figure that the monthly number of new users has a significant downward trend. It shows that the acquisition of new users is showing a significant downward trend, and attention needs to be paid to appropriately increase the marketing activities to improve the acquisition of new users

(2) Analysis of the proportion of users with one-time consumption and multiple consumption

#仅消费一次用户占比

(grouped_customer.Transaction_ID.nunique() == 1).sum()/df.Customer_ID.nunique()

It is calculated by: half of the users have only made one consumption

grouped_month_customer = df.groupby(['Month', 'Customer_ID'])


#每个用户每月的第一次购买时间
data_month_min_date = grouped_month_customer.Date.min().reset_index()
#每个用户的第一次购买时间
data_min_date = grouped_customer.Date.min().reset_index()


#通过Customer_ID联立两表
merged_date = pd.merge(data_month_min_date, data_min_date, on='Customer_ID')
merged_date.head()


#Date_x等于Date_y则为每月新用户
((merged_date.query('Date_x == Date_y')).groupby('Month').Customer_ID.count() / merged_date.groupby('Month').Customer_ID.count()).plot()

It can be seen from the above figure that the proportion of monthly new users is on a downward trend as a whole. Combined with the trend of monthly consumers, it can be seen that the number of consumers in the fourth quarter has an upward trend, so the number of repurchases has increased during the period.

3. User stratification analysis

(1) RFM hierarchical analysis

pivot_rfm = df.pivot_table(index='Customer_ID',
              values=['Date', 'Transaction_ID', 'Sales_Amount'],
              aggfunc={
    
    'Date':'max', 'Transaction_ID':'nunique', 'Sales_Amount':'sum'})


pivot_rfm['R'] = (pivot_rfm.Date.max() - pivot_rfm.Date)/np.timedelta64(1, 'D')
pivot_rfm.rename(columns={
    
    'Transaction_ID':'F', 'Sales_Amount':'M'}, inplace=True)


def label_func(data):
    label = data.apply(lambda x:'1' if x > 0 else '0')
    label = label.R + label.F + label.M
    labels = {
    
    
        '111':'重要价值客户',
        '011':'重要保持客户',
        '101':'重要发展客户',
        '001':'重要挽留客户',
        '110':'一般价值客户',
        '010':'一般保持客户',
        '100':'一般发展客户',
        '000':'一般挽留客户'
    }
    return labels[label]
pivot_rfm['label'] = pivot_rfm[['R','F','M']].apply(lambda x:x-x.mean()).apply(label_func, axis=1)


pivot_rfm.label.value_counts().plot.barh()

pivot_rfm.groupby('label').M.sum().plot.pie(figsize=(6,6), autopct='%3.2f%%')

pivot_rfm.groupby('label').agg(['sum', 'count'])

It can be seen from the above table and figure that:

The main source of sales is to maintain customers, with the highest proportion of general development customers
It is important to keep customers: the main source of sales, there has been consumption recently, and the consumption is high, and the consumption frequency is insufficient, and appropriate marketing activities can be held to increase the purchase frequency of this layer of customers
Important value customers: The second source of sales, with recent consumption, high consumption, and high frequency, keep this layer of customers as current as possible
Important development customers: high consumption and consumption frequency, no consumption recently, you can use appropriate strategies to recall users and participate in consumption
Important to retain customers: high consumption, but low frequency and no consumption recently, on the verge of loss, can participate in consumption through appropriate activities to prevent loss
General value customers: low consumption, high consumption frequency and recent consumption, you can use coupons and other forms of activities to stimulate the consumption of this layer of customers and increase their consumption
General development customers: The number of customers is the highest, and there is consumption recently, but the consumption amount and consumption frequency are not high. Considering the high proportion of the number of people, appropriate activities can be held to increase the consumption frequency and consumption amount
General customer retention: under cost and resource control, as appropriate
General Customer Retention: Considered as appropriate under cost and resource control

(2) Hierarchical analysis of user status

pivoted_status = df.pivot_table(index='Customer_ID', columns='Month', values='Date', aggfunc='count').fillna(0)

def active_status(data):
    status = []
    for i in range(len(data)):     
        #若本月没有消费
        if data[i] == 0:
            if len(status) > 0:
                if status[i-1] == 'unreg':
                    status.append('unreg')
                else:
                    status.append('unactive')
            else:
                status.append('unreg')
        #若本月有消费
        else:
            if len(status) > 0:
                if status[i-1] == 'unreg':
                    status.append('new')
                elif status[i-1] == 'unactive':
                    status.append('return')
                else:
                    status.append('active')
            else:
                status.append('new')
    status = pd.Series(status, index = data.index)
    return status

active_status = pivoted_status.apply(active_status, axis=1)

active_status.replace('unreg', np.nan).apply(lambda x:x.value_counts()).fillna(0).T.apply(lambda x: x/x.sum(),axis=1).plot.area()

As can be seen from the above figure:

New users: The proportion of new users shows a significant downward trend, indicating that the operation of attracting new users is insufficient.
Active users: The proportion reached the highest in February, followed by a slow downward trend, indicating that consumption operations are declining
Inactive users: Inactive users show a clear upward trend, and customer churn is more obvious
Returning customers: there is a slow upward trend, indicating that the recall operation is good

4. User life cycle analysis

(1) User life cycle distribution

#构成用户生命周期研究的数据样本需要消费次数>=2次的用户
clv = (grouped_customer[['Sales_Amount']].sum())[grouped_customer.Transaction_ID.nunique() > 1]


clv['lifetime'] = (grouped_customer.Date.max() - grouped_customer.Date.min())/np.timedelta64(1,'D')


clv.describe()

- 由上表可知：消费一次以上的用户平均生命周期为116天，用户生命周期内平均消费金额为121.47元

clv['lifetime'].plot.hist(bins = 50)

As can be seen from the above figure:

There are many users with a life cycle of 0-90 days, indicating that customers with a short life cycle account for a higher proportion and the churn rate within 90 days is high. These users can be used as the focus of operations to extend the life cycle of these users;
The distribution of the life cycle between 90-250 is relatively uniform, which is also the life cycle of most users, which can stimulate the consumption of these users and increase the consumption amount during the life cycle;
The number of people with a life cycle greater than 250 days is very small, indicating that the proportion of loyal customers with a long life cycle is not high.

(2) Distribution of user life cycle value

clv['Sales_Amount'].plot.hist(bins = 50)

As can be seen from the above figure:

The value of most users in the life cycle is within 500, and most of them are within 100. There are large extreme values that raise the average value, and the data is skewed to the right.

(3) User life cycle and its value-related relationships

plt.scatter(x='lifetime', y='Sales_Amount', data=clv)

As can be seen from the above figure:

There is no linear relationship between the user life cycle and the customer value during the period. When the life cycle is within 300 days, the value contributed by some users with a longer life cycle is higher than that of users with a shorter life cycle;
When the life cycle is greater than 300 days, there are some users who contribute less value. Due to insufficient data volume and other reasons, the results are for reference only

5. Analysis of repurchase rate and repurchase rate

(1) Analysis of repurchase rate

grouped_month_customer

customer_month_again = grouped_month_customer.nunique()
customer_month_again

#每月消费次数大于1的用户数
customer_month_again = grouped_month_customer.nunique().query('Transaction_ID > 1').reset_index().groupby('Month').count().Customer_ID
# customer_month_again
#每月消费用户数
customer_month = grouped_month.Customer_ID.nunique()
# #每月复购率
(customer_month_again/customer_month).plot()
customer_month
(customer_month_again/customer_month)

It can be seen from the above figure: the repurchase rate fluctuates around 25%, indicating that 25% of users will make multiple purchases every month; the repurchase rate in the first three months has declined, and then has rebounded, and the overall trend is upward. It should be combined with its own business model to decide whether to further increase the repurchase rate or focus on the acquisition of new users. Due to the lack of data in the last month, the results are mainly based on real data.

(2) Analysis of repurchase rate

#  1表示前90天消费且本月回购  0表示前90天消费本月未回购  nan表示前90天未消费
def buy_back(data):
    status = [np.nan,np.nan,np.nan]
    for i in range(3,len(data)):
        #本月购买
        if data[i] == 1:
            #前90天购买
            if (data[i-1] == 1 or data[i-2] ==1 or data[i-3] == 1):
                status.append(1)
            #前90天未购买
            else:
                status.append(np.nan)
        #本月未购买
        else:
            #前90天购买
            if (data[i-1] == 1 or data[i-2] ==1 or data[i-3] == 1):
                status.append(0)
            #前90天未购买
            else:
                status.append(np.nan)
    status = pd.Series(status, index = data.index)
    return status

back_status = pivoted_status.apply(buy_back, axis=1)
back_status.head()

(back_status.sum()/back_status.count()).plot()

As can be seen from the above figure: the repurchase rate within 90 days, that is, the repeat purchase rate within 90 days is less than 10%, indicating that the store is currently in the user acquisition mode. However, from the previous analysis, it can be seen that the acquisition of new users is on a downward trend. The store is not healthy at present, and the current stage should focus on new user acquisition,

6. Commodity association rule mining

(1) Analysis of hot-selling products

#取出销量排名前10的商品类型
hot_category = df.groupby('SKU_Category').count().Sales_Amount.sort_values(ascending=False)[:10].reset_index()
plt.barh(hot_category.SKU_Category, hot_category.Sales_Amount)

#热销商品占比
hot_category['percent'] = hot_category.Sales_Amount.apply(lambda x:x/hot_category.Sales_Amount.sum())
plt.figure(figsize=(6,6))
plt.pie(hot_category.percent,labels=hot_category.SKU_Category,autopct='%1.2f%%')
plt.show()

category_list = df.groupby('Transaction_ID').SKU_Category.apply(list).values.tolist()

from apyori import apriori

min_support_value = 0.02
min_confidence_value = 0.3
result = list(apriori(transactions=category_list, min_support=min_support_value, min_confidence=min_confidence_value, min_left=0))

result

From the above results we can get:

'FU5'–>'LPF': Support is about 2.1%, confidence is about 49.5%. It shows that the probability of purchasing these two types of products at the same time is about 2.1%. After purchasing FU5 type products first, the probability of purchasing LPF type products at the same time is 49.5%.
'IEV'–>'LPF': Support is about 3.1%, confidence is about 48.9%. It shows that the probability of purchasing these two types of products at the same time is about 3.1%, and the probability of purchasing LPF-type products at the same time after purchasing IEV-type products is about 48.9%.
'LPF'–>'IEV': Support is about 3.1%, confidence is about 43.3%. It shows that the probability of purchasing these two types of products at the same time is about 3.1%. After purchasing LPF type products first, the probability of purchasing IEV type products at the same time is about 43.3%.
'OXH'–>'LPF': Support is about 2.0%, confidence is about 48.1%. It shows that the probability of purchasing these two types of products at the same time is about 2.0%, and the probability of purchasing LPF-type products at the same time after purchasing IEV-type products is about 48.1%.

Finally finished typing, my god

Please add image description
That's it for today's article, I hope it helps you~

I'm Red Panda, see you in the next article (✿◡‿◡)

Please add image description

Python data analysis combat: supermarket retail store

Hello everyone! I am a red panda ❤

1. Project Background

2. Data sources

3. Ask questions

4. Understand the data

5. Data cleaning

1. Import data

2. Select a subset

6. Outlier Handling

6. Analysis content

1. Monthly analysis of consumption

2. User distribution analysis

3. User stratification analysis

4. User life cycle analysis

5. Analysis of repurchase rate and repurchase rate

6. Commodity association rule mining

Finally finished typing, my god

I'm Red Panda, see you in the next article (✿◡‿◡)

Guess you like