1. Analysis background
This is a sales data of an e-commerce platform. The data includes sales data from April 22, 2010 to July 24, 2014. Analyzing the sales data can discover customer value.
Now use KMeans clustering to realize the LRFM model to analyze the value of customers, facilitate customer grouping, targeted promotion, and increase sales.
LRFM model definition:
L: The time interval between the member creation date and July 25, 2014 (unit: month)
R: The time interval between the last purchase by the member and July 25, 2014 (unit: month)
F: Number of member purchases
M: total purchase amount of the member
2. Analysis process
3. Data Exploration
3.1 Import related packages and read data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from datetime import datetime
from sklearn.cluster import KMeans
plt.rcParams['font.sans-serif'] = 'SimHei'
%matplotlib inline
# 读取数据
df = pd.read_csv(r'C:/Users/Administrator/Desktop/RFM分析1.csv',
engine='python')
# 查看行列
df.shape
Output:
3.2 View table structure
It can be seen from the figure that only class2 has missing values in the data here. There is no need to extract this indicator for the time being, and we will not clean it.
3.3 Descriptive analysis view
If the sales amount is negative, these outliers must be filtered during data cleaning.
4. Data cleaning
4.1 Filter out sales <0
# 销售金额有小于等于0的,直接过滤掉
# 这里有22542条数据
data = df[df['销售金额'] >0]
data.shape
Output:
4.2 Conversion of member creation date and sales date to datetime format
data['会员创建日期'] = pd.to_datetime(data['会员创建日期'])
data['销售日期'] = pd.to_datetime(data['销售日期'])
# 查看是否转换成功
data.info()
Output:
5. Construct L, R, F, M indicators
5.1 Extract useful indicators
L = relative date (here I specify: July 25, 2014)-member creation date
R = relative date (here I specify: July 25, 2014)-the latest (largest) sale date
F = the number of purchases by the user (the serial number is different here)
M = the aggregate amount purchased by the user to buy on
behalf of:
# 计算L,再转换成月,这里转换成月,直接除于30天,保留两位小数
# L是最早的购买日期距离会员创建日期
data1 = data.groupby('UseId').agg({'会员创建日期': ['min'],
'销售日期': ['min', 'max'],
'销售金额':['sum'],
'流水号':['nunique']})
data1
Output:
Delete a layer of column names and rename them:
# 删除第一层的列名
data1.columns = [col[1] for col in data1.columns]
# 重新命名列名
data1.columns = ['会员创建日期', '最早销售日期', '最晚销售日期', 'M', 'F']
data1
Output:
The M and F indicators have been constructed.
5.2 Purchase L and R indicators
# 先计算L,R,再转化成单位月
data1['L'] = datetime.strptime('2014-7-25', '%Y-%m-%d') - data1['会员创建日期']
data1['R'] = datetime.strptime('2014-7-25', '%Y-%m-%d') - data1['最晚销售日期']
# 将L、R转换成月做为单位
data1['L'] = data1['L'].apply(lambda x: round(x.days/30,3))
data1['R'] = data1['R'].apply(lambda x: round(x.days/30,3))
data1
Output result:
Extract useful indicators:
LRFM_data = data1[['L', 'R', 'F', 'M']]
6. Perform Z-Score conversion of L, R, F, M data
ss = preprocessing.StandardScaler()
ss_LRFM_data = ss.fit_transform(LRFM_data)
ss_LRFM_data
Output:
7. Use KMeans for cluster analysis
# n_clusters聚类的个数
kmodel = KMeans(n_clusters=5, n_jobs=4)
kmodel.fit(ss_LRFM_data)
#查看聚类中心
kmodel.cluster_centers_
Output:
Convert the result into a DataFrame
client_level = pd.DataFrame(kmodel.cluster_centers_,
index=['客户群1', '客户群2', '客户群3', '客户群4', '客户群5'],
columns=['L', 'R', 'F', 'M'])
client_level
Output:
8. Categorize customer groups based on the results
The larger the L is, the longer the time for registered members is from the specified time (July 25, 2014), and the older customers. The larger the indicator, the better.
The smaller the R is, the shorter the purchase time is from the specified time (July 25, 2014), the smaller the R, the better.
The larger the F, the more purchases the member makes.
The larger the M, the more the amount purchased on behalf of the member.
Customer group 1 analysis:
L is large, R is small, F is large, and M is large. The judgment here is an important development customer.
Customer group 2 analysis:
L is large, R is large, F is small, and M is small. The judgment here is important to retain customers.
Customer group 3 analysis:
L is small, R is small, F is small, M is small, here it is judged that it is a low-value customer.
Customer group 4 analysis:
L is large, R is large, F is small, and M is small. It is judged that they are general value customers.
Customer group 5 analysis:
L is large, R is small, F is large, and M is large. The judgment here is important to keep customers.