Analysis and summary of personalized recommendation e-commerce case based on clustering

Before writing the article, first of all thank two teachers: Ju An sauce teacher and Luo teacher. Under their guidance, the project can be completed successfully.
The data is desensitized, please do not dispute the authenticity of the data:
full code link .

First, the purpose

Divide different customer groups based on user's order transaction behavior and product attribute characteristics analysis, and then personally recommend unpurchased products to users

Second, analysis ideas

1. Starting from the category attributes of commodities, explore a customer group with the same preference through a clustering model.
2. Then use the unit price of commodities as a quantitative indicator of consumption behavior, find out the favorite products belonging to the same customer group and recommend them to other users who have not purchased

3. Clustering based on user data

1. Read data

df_order = pd.read_csv("orders.csv")                          #读取订单数据
df_item = pd.read_csv("Items_orders.csv")                     #读取商品交易数据
df_atrr = pd.read_csv("Items_attribute.csv",encoding='gbk')   #读取商品属性数据

2. Explore the data

2.1, first observe the correlation between the three tables

1. View the order number data in the order form and order detail table, you can analyze that the order form is the summary result of the order detail table, and the merge of the two tables can explore the user's order behavior data

df_order.shape,np.unique(df_order.订单编号).shape  #可以看出没有重复数据
df_item.shape,np.unique(df_item.订单编号).shape  
#订单编号去重后与df_order的行数一致
#结合数据,我们可以肯定df_item是订单明细表,df_order是订单表,两个表需要合并成一张表

1. Check the baby ID and title data in the product information table. If we want to explore the user's product behavior relationship data, we must merge the product information table and order details table.

df_atrr.shape,np.unique(df_atrr.宝贝ID).shape   #可以看出没有重复数据
#和df_item表对照看,这张表是商品信息表。
#需要和df_item表合并,把商品属性信息补充到df_item表中
#应该用宝贝ID作为连接字段,因df_item表中无宝贝ID字段;只有标题是共同字段;所以我们考虑用标题作为连接字段

np.unique(df_atrr.标题).shape,np.unique(df_item.标题).shape 
#发现df_item表中有部分标题在df_atrr中没有

2.2. Check if the imported column name needs preprocessing

The purpose of this step is to remove spaces in column names to facilitate data retrieval.

for df in [df_order,df_atrr,df_item]:
    df.columns=df.columns.map(lambda x:x.strip())
#把所有表列名中的空格去除

2.3. Select fields based on business evaluation

How to explore specifically? I will not explain it in detail. The detailed code link already has a detailed operation process. This step is the most cumbersome and important. We analyze layer by layer like peeling corn, and finally obtain user order behavior data, user transaction behavior data, and user commodity category behavior data. Take the following steps:

2.3.1. Based on user order data, mine user order behavior data.

1. Use the mean () method to count the proportion of missing values ​​to delete features with missing values ​​greater than 80%

#把其余缺失值多余80%的数据删除
isna_columns=df_order_d1.isnull().mean()>0.8
df_order_d1.drop(columns=isna_columns[isna_columns].index.tolist(),axis=1,inplace=True)

2. We use the describe () method to check whether there is an abnormal value, and then use the seaborn tool to draw a histogram (observation distribution curve) and box line of the "buyer actual payment amount", "baby type", "total number of baby" Graph analysis outlier processing scheme. In the end, we only chose to deal with the outliers of the two characteristics of "the actual amount paid by the buyer" and "total number of treasures", and only a total of 10 records were deleted. Why not delete the outliers of the 'Baby Type' feature? Because of the data and graphics, the data distribution is relatively uniform, and the outliers can not be deleted if they are meaningful.
Only the graphics of 'baby species' features are shown here.

f,(ax1,ax2)=plt.subplots(1,2,figsize=(12,6))
sns.distplot(df_order_d2.宝贝种类,ax=ax1)
sns.boxplot(y='宝贝种类',data=df_order_d2,ax=ax2)
plt.show()

Insert picture description here

2.3.2. Mining user transaction behavior data based on user transaction data

This process data exploration is relatively simple, the characteristics of the order list are few, and the only useful information is the price. After observing that there are no missing values, use pd.merge and the df_order_d3 table in the previous process to merge and summarize the user transaction behavior data.

2.3.3. Mining the user's commodity category behavior data based on the attributes of the commodity purchased by the user

Through the exploration of each feature, it is finally found that only the applicable age feature can be used, and this feature can be used to classify and generate new tag features as the category of the product

#定义一个商品适用年龄的标签结构
#2岁以下不包含2岁,定义为婴儿,包含月
#2岁-4岁定义为幼儿
#5岁-7岁-定义儿童
#8岁以上定义为学生
def addTag(x):
    tag=''
    if '月' in x:
        tag+='婴儿'
    x=x.split(',')
    if '2岁' in x or '3岁' in x or '4岁' in x:
        tag+='幼儿|'
    if '5岁' in x or '6岁' in x or '7岁' in x:
        tag+='儿童|'
    if '8岁' in x or '9岁' in x or '10岁' in x or  '11岁' in x or '12岁' in x or '13岁'in x or '14岁' in x:
        tag+='学生|'
    if 'missing' in x:
        tag+='missing'
    return tag
    df_atrr['tag']=df_atrr.适用年龄.apply(addTag)
#通过适用年龄对于各玩具商品划分类别
df_atrr.head()

2.3.4. Data standardization

The purpose of this step is to eliminate the dimensional influence caused by the characteristics such as 'the actual amount paid by the buyer', 'type of baby', 'total number of baby', and 'price'.

from sklearn.preprocessing import MinMaxScaler  #导入标准处理包
#获取分析数据
data_pre=user_info5.iloc[:,1:].values
mms=MinMaxScaler()
data_norm=mms.fit_transform(data_pre)   #标准化处理消除量纲影响
pd.DataFrame(data_norm).head()

3. Modeling

1. We first determine the optimal K value by drawing the contour coefficient curve. From the graphic point of view, the K = 2 model works best. From a business perspective, the k = 2 category is not well divided; we exclude 2 and then observe the other data. The highest point is 8, the score is also good, and it fits the business needs. Therefore, the optimal K value we choose is 8

from sklearn.metrics import silhouette_score  #导入轮廓系数计算包

#通过轮廓系数寻找最优的k
score=[]
for k in range(2,16):
    km=KMeans(n_clusters=k)
    res_km=km.fit(data_norm)
    score.append(silhouette_score(data_norm,res_km.labels_))

plt.plot(range(2,16),score,marker='o')

Insert picture description here2. Establish a clustering model and add the model fitting results to the table.

#把结果加入user_info5表中
user_info5['类别']=km.labels_
user_info5.head()

4. Divide user groups according to the model

The previous preparation work is very sufficient, this step is very simple, we can extract our target data with only one sentence.

cluster_result=user_info5.loc[:,['买家会员名','类别']]
cluster_result.head()

4. Personalized recommendation based on user clustering results

No code is attached to the back, and the focus of this case is on data exploration and cleaning in the front. Next, explain the following ideas:
1. Combine the two tables df_order and df_item to find the list of product information that the user has not bought [user-product (notbuy)]
2. Use the [user-product (notbuy) obtained in step 1 )] Table matching user group table (user group data divided according to clustering model) to obtain [user-commodity (notbuy) -group] data
3. Define user preferences (users ’preference for the number of purchases of products), build users The preference of the product [user-commodity-preference]
4. Use the [user-commodity-preference] table obtained in step 3 to match the user group table to obtain [user-product-preference-group] data
5. The same group Users in the group aggregate the preferences of the same product to obtain the user group ’s preference for each product [group-commodity-average preference] data
6. The [user-product (notbuy) -group obtained in step 2 】 The table is combined with the [group-commodity-average preference] table obtained in step 5 (commodity-group) to obtain [user-commodity (notbuy) -group-group average preference] data
7. Row group sort, get TopN recommendation list

Five, export the final result

Attach my loyal code snippet:

if not os.path.exists('save_result'):
    os.makedirs('save_result') #文件夹不存在,创建文件夹

topk.to_csv('save_result/Cluster_User_Item_Topn_data.csv',index=False,encoding='GBK')

6. Summary

1. The three most important processes of the entire project are data exploration, data cleaning, and data preprocessing, which account for about 80% of the total time. At a certain point, unclear thinking or improper operation can easily lead to errors in the final result.
2. Clustering is an algorithm based on distance calculation. Before running the model, it is necessary to standardize the data to eliminate the influence of dimension.

Published 12 original articles · Like9 · Visitors 20,000+

Guess you like

Origin blog.csdn.net/sun91019718/article/details/101323930