[Machine learning notes] Python data analysis: user consumption behavior (continuous update)

Python data analysis: user consumption behavior (continuously updated)

Wine tasting and user consumption behavior analysis are two cases where I learned Python data analysis, and record it.

There are many introductions about these two cases on the Internet, but during the learning process, I found that the logic of many articles is not very clear, and the code is also debugged differently.

Therefore, I still want to write my own debugging code.

Reference blog post:

http://www.360doc.com/content/17/0717/17/16619343_672115832.shtml

https://blog.csdn.net/weixin_44266342/article/details/94187331?depth_1-utm_source=distribute.pc_relevant.none-task&utm_source=distribute.pc_relevant.none-task

https://blog.csdn.net/weixin_44875199/article/details/91452282?depth_1-utm_source=distribute.pc_relevant.none-task&utm_source=distribute.pc_relevant.none-task

 


ready

1. Objectives of data analysis

2. Data sets

3. Before data analysis, perform data cleaning. That is: deal with missing values, convert data types, and organize the data as needed.


Start

This data set has a total of about 60,000 pieces of data. The data is the user behavior data from January 1997 to June 1998 on the CDNow website, with a total of 4 columns of fields, namely:

  • user_id: user ID
  • order_dt: date of purchase
  • order_products: number of products purchased
  • order_amount: purchase amount
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
columns = ['user_id','order_dt','order_products','order_amount']
# 用户id 购买日期 购买产品数 购买金额
df=pd.read_csv('G:\\Data Scientist Learning\\CDNOW_master.txt',names=columns,sep='\s+') #载入数据
print(df.head())

 

Next:

df.info() #检查数据是否存在空值

When the data is read out correctly, check whether there is a null value in the data, and check the data type of the data. Found that there is no null value in the data, very clean data. Then, since we need this monthly data, we must properly convert the data in the order_dt column into the usual time format: Y (year) m (month) D (day).

The following code means: add a new column name in df is month, take out the date order_dt of this column and then take the value of this column to convert the value into month, for example, June 1 to 30 All belong to June.
 

 

 At this time, a simple preliminary analysis of the data can be performed, using df.describe ().

Analyze the above data:

  • Average: The average number of products purchased is 2.4;
  • Median: The median is 2;
  • Quartile: its 3/4 digit is 3
  • The minimum and maximum values ​​are: 1 and 3, respectively
  • Per capita consumption of CD amount: 35.89 yuan; median: 25.98 yuan

Note: Most of the purchase volume of users is not large, and a small amount of purchase volume is large. The maximum purchase volume is 99. There is a certain amount of extreme interference. The user's order amount is relatively stable, the amount of CD purchased per capita is 35, and the median is 25 yuan. There is extreme interference. Many sales industries are similar to this distribution. There are more small amounts, less large amounts, and a large source of income. Part of it comes from large amounts. That is the law of 28.


Analyze data trends monthly 

A groupby is used here, a very useful function in data analysis. This section is to analyze user behavior by month and use groupby to group users by month.

After the grouping is completed, a new dataframe called group_month is obtained, and then the order_amount in the group is directly taken out and summed to obtain the total sales for each month, and a line chart is drawn

 

 

 

 


It can be seen from the above three graphs that there is no problem with the data. The total number of user purchases is roughly the same as the number of user purchases and the trend of user purchases. However, since April, sales have fallen sharply. What is the reason? Consumers per month:

 

The monthly number of consumers is less than the monthly number of consumption, but the difference is not significant. In the first three months, the monthly number of consumers is between 8,000 and 10,000. In subsequent months, the average number of consumers is less than 2,000. The same is the trend of a large number of consumers in the early period and a steady decline in the later period.


Analysis of individual consumer data:

What I used to look at is the trend, and now I look at the level of individual consumption. The main analysis objects are:

  • Statistics and scatter plots of user consumption amount and consumption times to observe the average consumption level of users
  • Distribution map of user consumption amount (the 28th principle)
  • Distribution graph of user consumption times
  • The proportion of the user's cumulative consumption amount (how much of the user accounted for more than 100%)
     
group_user_ID = df.groupby('user_id')
print(group_user_ID.sum().describe())

 

 

Grouped by user_id as an index, from the user's point of view, each user purchased an average of seven CDs, the least users bought one, the most purchased was 1033, and the median was three, reflecting some data fluctuations or Very large, the average amount of purchases by users is 106, the median is 43, the maximum purchase amount is 13990, and the quartile is 19. These data, combined with the previous monthly analysis, roughly outline the general trend of CD sales. Rise, suddenly the recession began to fall sharply in a certain period, but most of them are still stable, and sales are also low.
 

group_userID = df.groupby('user_id')
group_userID.sum().query("order_amount<3000").plot.scatter(x = 'order_amount',y = 'order_products')
# group_userID.sum().order_amount. plot.hist(bins = 20)
# group_userID.sum().query("order_products<100").order_products.plot.hist(bins = 40)
#柱状图
plt.show()

The previous code means that user_id is used as the index for grouping, but after grouping, you may find that the printed object is because you need to perform further operations on the grouped data, such as summing and averaging. Then what is used here is to sum the data, and then call the quary method to specify that the value of the x-axis coordinate order_amunt is less than 3000, call the scatter scatter plot in the plot, and draw the scatter plot.

Scatter chart of user purchase amount and purchase quantity

 It can be seen from the scatterplot that the data is concentrated on small purchases and small purchases. The data is basically linearly distributed. The larger the amount of CDs purchased, the greater the amount, and the smaller the amount, the less.

group_userID.sum().order_amount. plot.hist(bins = 20)

User consumption amount distribution 

From the consumption amount, it can be seen that the consumption amount is very low, which is basically between 0-1000 yuan, and it can be seen that it is mainly for low-cost people. 

group_userID.sum().query("order_products<100").order_products.plot.hist(bins = 40)

 User consumption times distribution

It can be seen from the histogram of consumption times that the vast majority of users consume few or even few times, and the consumption times are basically between 0-20 times.
 

cum1 = group_userID.sum().sort_values("order_amount").apply(lambda x:x.cumsum()/x.sum())
cum1.reset_index().order_amount.plot()
plt.show()

The above code means to find the proportion of the user's cumulative consumption amount. The cumsum method is rolling summation. The index of the dataframe after the proportion is calculated is reset. The index after resetting the index is arranged in ascending order Therefore, the horizontal axis of the graph is the index, and the vertical axis is the proportion of consumption, which can reflect how much of the user accounts for what percentage of the consumption. 

From the proportion of consumption, it can be seen that 50% of the users accounted for less than 20% of the consumption, and the top 500 users accounted for nearly 50% of the consumption. Mainly concentrated on some large customers.


User consumption behavior analysis

  • User's first consumption (first purchase) time
  • User's last consumption time
  • User stratification

RFM (RFM model is an important tool and means to measure customer value and customer profitability)

New, old, active, backflow, loss

  • User purchase cycle (per order)

User consumption cycle description

User consumption cycle distribution

  • User life cycle (according to the first & last consumption)

User life cycle description

User life cycle distribution

User first purchase time
 

# 用户首次购买时间
grouped_user.min().month.value_counts()
grouped_user.min().order_dt.value_counts().plot() # 首购
plt.show()

 

When the user last purchased

grouped_user.month.max().value_counts()
grouped_user.max().order_dt.value_counts().plot() # 最后一次消费
plt.show()

 

The first purchase is from January to March, and the last purchase is basically concentrated from January to March. There are not many long-term active customers. Most users do not purchase after one purchase. The number of users is also increasing.

The amount of the first consumption time equal to the last consumption time accounts for half of the total, indicating that many customers only consume once and no longer consume.


User data layering

Divide users into:

  • 111 ':' important value customer ',
  • '011': 'Important to keep customers',
  • '101': 'Importantly retain customers',
  • '001': 'Important development customers',
  • '110': 'General value customer',
  • '010': 'General customer retention',
  • '100': 'General customer retention',
  • '000': 'General development customer'

The meaning of the previous figures will be explained later.

Construct RFM model (Recency Frequency Monetary )

Execute the following two pieces of code separately:

rfm = df.pivot_table(index = 'user_id',
                    values = ['order_products', 'order_amount', 'order_dt'],
                    aggfunc = {'order_dt':'max',
                               'order_amount':'sum',
                               'order_products':'sum'
                              })
rfm.head()

At this point, I started to use a new function and the perspective function of python. The function of point_table is the same as that of excel, but it is more flexible than the perspective table . Df.point_table (index = [], columns = [], values ​​= [], aggfunc = []) These parameters will be used later, first explain the meaning of these parameters:

  • index refers to which field is selected as the index when grouping;
  • columns refers to the specified column name;
  • values ​​can decide which attribute fields to keep;
  • aggfunc is the function that decides to perform on each field
  • Do not write the default execution sum
rfm['R'] = -(rfm.order_dt - rfm.order_dt.max()) / np.timedelta64(1, 'D')
rfm.rename(columns = {'order_products': 'F', 'order_amount':'M'},
                 inplace=True)
rfm.head()

 

def rfm_func(x):
    level = x.apply(lambda x:'1' if x>=1 else '0')
    label = level.R + level.F + level.M
    d = {
        '111':'重要价值客户', 
        '011':'重要保持客户',
        '101':'重要挽留客户',
        '001':'重要发展客户',
        '110':'一般价值客户',
        '010':'一般保持客户',
        '100':'一般挽留客户',
        '000':'一般发展客户'
        }
    result = d[label]
    return result

rfm['label'] = rfm[['R', 'F', 'M']].apply(lambda x:x-x.mean()).apply(rfm_func,axis=1)
rfm.head()

rfm.groupby('label').sum()

for label,gropued in rfm.groupby('label'):
    x= gropued['F']
    y = gropued['R']
    
    plt.scatter(x,y,label = label) # 利用循环绘制函数
plt.legend(loc='best') # 图例位置
plt.xlabel('Frequency')
plt.ylabel('Recency')
plt.show()

It can be seen from the RFM stratification that most users are important to keep customers, but this is because of the existence of extreme values, so the division of FRM should be divided according to the business

  • Try to cover most of the amount with a small number of users
  • Do n’t rank for good data

Hierarchical users by new, active, return, and lost

# 通过每月是否消费来划分用户
pivoted_counts = df.pivot_table(index = 'user_id',
                                columns = 'month',
                                values = 'order_dt',
                                aggfunc = 'count').fillna(0)
pivoted_counts.columns = df.month.sort_values().astype('str').unique()
pivoted_counts.head()

df_purchase = pivoted_counts.applymap(lambda x: 1 if x> 0 else 0)
df_purchase.tail() 

If there is no consumption this month

  • If you have not registered before, you still have not registered
  • If there was previous consumption, it is lost / active
  • Otherwise, not registered

If you spend this month

  • If it is the first time consumption, it is a new user
  • If there has been consumption before, it was inactive last month, it is a return
  • If not registered last month, it is a new user
  • In addition, for active
purchase_states = df_purchase.apply(active_status,axis = 1)
purchase_states.tail()

 

purchase_states_ct = purchase_states.replace('unreg',np.NaN).apply(lambda x:pd.value_counts(x))
purchase_states_ct

The unreg status is excluded, and it is only in the future that it becomes a new user, and the monthly statistics of users are presented as different points.

# 转置后方便观察
purchase_states_ct.fillna(0).T

# 绘制面积图
purchase_states_ct.fillna(0).T.plot.area(figsize = (12, 6))
plt.show()

From the area chart, the blue and gray areas occupy a large area, you can not look at it, because this is only the follow-up behavior of users who have spent a certain period of time. Secondly, the active users represented by red are very stable, belonging to core users, and the return users of purple. The two layers add up to the proportion of the number of consumer users (later no new customers)

Proportion of return users

plt.figure(figsize=(20, 4))
rate = purchase_states_ct.fillna(0).T.apply(lambda x: x/x.sum())
plt.plot(rate['return'],label='return')
plt.plot(rate['active'],label='active')
plt.legend()
plt.show()

  • Return users ratio: the proportion of return users in total users within a certain period of time

       It can be seen from the figure that the percentage of users returning to users each month is between 5% and 8%, and there is a downward trend, indicating that customers have a tendency to churn.

  • Returning user rate: how many inactive users spent last month

       Since the number of inactive users in this data is basically unchanged, the reflow rate here is also approximately equal to the reflow ratio

  • Active user ratio: The proportion of active users in total users within a certain period of time.

       The proportion of active users is between 3% and 5%, and the downward trend is more significant. Active users can be regarded as continuous consumption users, and their loyalty is higher than that of returning farmers.

Combined with active users and return users, 60% of the post-consumer users are return users and 40% are active users. The overall user quality is relatively good. It also further illustrates the 28th law in the analysis of user consumption behavior in the previous section, which reflects the fact that it is the same principle to pay close attention to high-quality users in the consumer field.

User purchase cycle

# 订单时间间隔
order_diff = grouped_user.apply(lambda x:x.order_dt - x.order_dt.shift())
order_diff.head(10)

order_diff.describe()

# 订单周期分布图
(order_diff / np.timedelta64(1, 'D')).hist(bins = 20)
plt.show()

  • Order cycle is exponentially distributed
  • The average user purchase cycle is 68 days
  • Most users have a purchase cycle of less than 100 days
  • The user life cycle diagram is a typical long-tailed diagram, and the consumption interval of most users is indeed relatively short. May wish to set the time recall point as a gift coupon immediately after consumption, ask the user about the CD 10 days after consumption, remind the coupon to expire 30 days after consumption, and push the SMS 60 days after consumption.

User life cycle

# 最后一次购买的时间减去首购时间
user_life = grouped_user.order_dt.agg(['min', 'max'])
user_life.head()

# 只消费过一次的用户占比
(user_life['min'] == user_life['max']).value_counts().plot.pie()
plt.show()

(user_life['max'] - user_life['min']).describe()

It can be seen from the description that the average life cycle of users is 134 days, which is higher than expected, but the average is not reliable. The median is 0 days. Most users consume for the first time and last time. This batch belongs to low-quality users, and the largest is 544 days, almost the total number of days in the data set, this user belongs to the core user.
Because the users in the data are the first consumption in the first three months, the life cycle here represents the life cycle of users from January to March. Because users will continue to consume and continue to consume after this period of time, the average life cycle of users will increase.
 

plt.figure(figsize=(20, 4))
plt.subplot(121)
((user_life['max'] - user_life['min']) / np.timedelta64(1, 'D')).hist(bins = 15)
plt.title('二次消费以上用户的生命周期直方图')
plt.xlabel('天数')
plt.ylabel('人数')

# 过滤生命周期为0 的
plt.subplot(122)
u_l = ((user_life['max'] - user_life['min']).reset_index()[0] / np.timedelta64(1, 'D'))
u_l[u_l > 0].hist(bins = 40)
plt.title('二次消费以上用户的生命周期直方图')
plt.xlabel('天数')
plt.ylabel('人数')
plt.show()

It can be seen from the comparison of the two images that after filtering out users with a period of 0, the image has a double-peak structure. Although there are still many users whose life cycle tends to 0 days, it is more reliable than the first picture. Some low-quality users, although spending twice, still can't continue to consume. In order to improve the conversion rate of users, users should be guided as much as possible within 30 days of their first consumption. A small number of users are concentrated in 50-300 days. After 400 days, it is high-quality users, and the number of people is still increasing in the later period. This group of users is already a core user, and has a high degree of loyalty. Try to protect the interests of these users.

# 消费两次以上用户平均生命周期
u_l[u_l > 0].mean()

The average life cycle of a user who consumes more than twice is 276 days, which is much higher than the overall, so how to guide users to make multiple consumptions after the first consumption can effectively improve the user life cycle.


Analysis of repurchase rate and repurchase rate

  • Repurchase rate
    • Proportion of users who purchased multiple times within a natural month
  • Repurchase rate
    • Percentage of users who bought once again purchase within a certain period
# 消费两次及以上为 1 ,消费一次为 0 ,没有消费为空
purchase_r = pivoted_counts.applymap(lambda x: 1 if x > 1 else np.NaN if x==0 else 0)
purchase_r.head()

 

# 复购率折线图
(purchase_r.sum() / purchase_r.count()).plot(figsize = (10, 4)) 
plt.show()

The repurchase rate stabilized at around 20%. In the first three months, a large number of new users flooded in, and this group of users only bought once, so the repurchase rate decreased.

def purchase_back(data):
    status = []
    for i in range(17):
        if data[i] == 1:
            if data[i+1] == 1:
                status.append(1)
            if data[i+1] == 0:
                status.append(0)
        else:
            status.append(np.NaN)
    status.append(np.NaN) 
    return pd.Series(status,df_purchase.columns)
purchase_b = df_purchase.apply(purchase_back,axis = 1)
purchase_b.head()

1 is a repurchase user, 0 is not purchased in the previous month and was purchased in the current month, NaN is not purchased for two consecutive months

It can be seen from the repurchase rate chart that the user repurchase rate is higher than the repurchase rate, about 30%, and the volatility is strong. The repurchase rate of new users is around 15%, which is not much different from that of old users.
According to the number distribution chart, the number of repurchases has stabilized after the first three months, so the reason for the fluctuation may be due to the marketing season, but the consumer behavior of the previous repurchase users is roughly the same as the behavior of users who will repurchase. The users are coincident and belong to high-quality users.
Combined with the analysis of the repurchase rate and repurchase rate, it can be seen that the overall loyalty of new customers is lower than that of old customers. The repurchase rate of old customers is better, and the consumption frequency is slightly lower. This is the user consumption characteristics of CDNow.


Retention rate analysis

 

# 每一次消费距第一次消费的时间差值
user_purchase = df[['user_id','order_products','order_amount','order_dt']]
user_purchase_retention = pd.merge(left = user_purchase,
                                   right = user_life['min'].reset_index(),
                                  how = 'inner',
                                  on = 'user_id')
user_purchase_retention['order_dt_diff'] = user_purchase_retention['order_dt']-user_purchase_retention['min']
user_purchase_retention['dt_diff'] = user_purchase_retention.order_dt_diff.apply(lambda x: x/np.timedelta64(1,'D'))

user_purchase_retention.head()

 

 

Follow-up supplement.

 

 

 

 




 

 

 

Published 646 original articles · praised 198 · 690,000 views

Guess you like

Origin blog.csdn.net/seagal890/article/details/105229075