Big Data - Recommendation System

1 Development of recommendation system

The recommendation system refers to what to recommend to users when they enter the product without needs. Today's APP basically uses the recommendation system.
Portal websites starting from the 1990s, such as Yahoo, Sohu, Hao123, etc., are all webpage navigation websites based on categories, which aggregate various webpages into one webpage, which is convenient for users to jump and visit; since 2000s, enter the search Engines, such as Baidu, Google, and Bing, allow users to find the websites they need through purposeful searches; since 2010s, the recommendation system does not require users to provide clear requirements, and actively recommends them to users by analyzing their historical behavior The things they are interested in, typical apps include: Kuaishou, Douyin, Station B, etc. Basically, the current apps are based on recommendation systems.
insert image description here

2 How the recommendation system works

The recommendation system is based on the following four recommendations:

  1. Social recommendation Let the user's social relationship make recommendations, such as friend recommendations, such as sharing functions;
  2. Based on content recommendation According to the user's search, understand the user's interest;
  3. Based on hotspot recommendation Recommend current hotspot information to users;
  4. Recommendation based on collaborative filtering Recommend things to the same type of users and expand the boundaries of users.

The comprehensive application of the above four recommendations can efficiently link users and products, increase user activity and stay time, and thus increase the commercial value of products. Currently the most successful app is Douyin.
insert image description here

3 Overall architecture of the recommendation system

用户服务
用户行为反馈
数据采集
推荐算法

As shown in the figure above, the user service is mainly the front-end interface, the data collection will adopt the Lambda architecture, and the recommendation algorithm includes two aspects of recall and sorting.
The Lambda architecture diagram of the data collection architecture of the recommendation system is as follows:

实时处理
批处理
数据收集Flume
数据收集Flume
视图存储数据库
Memcached
Redis
Spark streaming
Storm
Flink
Kafka
HBase
HDFS
HDFS
MySQL
Oracle
分布式计算
MapReduce
Spark
数据源

在批处理层,数据不可变,并且能进行任何计算同时可水平扩展;但对于及时性要求不高,可以是几分钟的延迟也可以是几个小时的延迟。
在实时处理层需要低延迟(最好在秒级)同时进行持续计算。
推荐算法架构可以分为召回、排序和策略调整三个主要阶段,其主要架构图如下:

海量Item
召回
候选集合
排序
排序列表
规则
topN
展示到页面

4 推荐算法

推荐算法部分跟业务数据分析中的机器学习过程很类似,包含以下步骤:

  1. 数据处理
  2. 特征工程
  3. 算法模型训练
  4. 产生推荐结果并评估

前两个不具体展开,原理同机器学习数据分析一样,算法模型部分会不同,这里主要会采用协同过滤算法。

协同过滤算法

该算法的核心思想为:物以类聚,人以群分,一般基于两种假设:

  • 基于用户协同过滤:跟你有相同喜好的用户喜欢的东西,你也可能会喜欢。
  • 基于物品系统过滤:你喜欢的东西同性质的东西,你也可能会喜欢。

协同过滤算法主要是两个步骤:

  1. 找出TopN相似的人或者物品:一般通过计算两两的相似度来进行排序。
  2. 根据相似的人或者物品进行推荐:利用TopN的推荐结果,过滤掉已有的东西或者明确不喜欢的东西,就是最后的结果。

以下是基于用户协同过滤步骤:
insert image description here
基于物品协同过滤步骤:
insert image description here
上面的相似度计算公式如下(以用户协同过滤为例):
用户1和用户2相同的物品数量:2
用户1物品数量:3
用户2物品数量:3
2/3×2/3 = 0.67×0.67

相似度计算

相似度的计算方法主要包含以下四种:

  • 欧式距离
    欧式距离公式: E = ∑ i = 1 n ( p i − q i ) 2 E = \sqrt{\sum_{i=1}^{n}(p_i - q_i)^2} E=i=1n(piqi)2
    因为相似度的结果是[-1,1]之间,所以进行如下转换: 1 1 + E \frac{1}{1+E} 1+E1

  • 余弦相似度
    insert image description here

  • Pearson相关系数
    余弦相似度的变形,对向量去中心化,即各自减去向量均值。

  • 杰卡德相似度Jaccard
    insert image description here

在选择相似度计算方法:

  • 数值型的采用余弦相似度或者Person相关系数;
  • 布尔型数据一般采用杰卡德相似度。

协同过滤算法代码

导入模块

import pandas as pd
import numpy as np
from sklearn.metrics import jaccard_score
from sklearn.metrics.pairwise import pairwise_distances
from pprint import pprint

准备数据


users = ["User1", "User2", "User3", "User4", "User5"]
items = ["Item A", "Item B", "Item C", "Item D", "Item E"]
# ⽤户购买记录数据集,1表示购买,0表示没有购买
datasets = [
 [1,0,1,1,0],
 [1,0,0,1,1],
 [1,0,1,0,0],
 [0,1,0,1,1],
 [1,1,1,0,1],
]
df = pd.DataFrame(datasets, columns=items, index=users)
df

insert image description here

基于用户之间的相似度

# 计算⽤户间相似度
user_similar = 1 - pairwise_distances(df.values,metric='jaccard')
user_similar = pd.DataFrame(user_similar, columns=users, index=users)
print("⽤户之间的两两相似度:")
print(user_similar)

insert image description here

每个用户相似用户top2

topN_user = {
    
    }
for i in user_similar.index:
    # 取出每列数据,并删除自己的数据
    df_ = user_similar.loc[i].drop([i])
    # 按照相似度降序排序
    df_sorted = df_.sort_values(ascending=False)
    # 取前两条结果
    top2 = list(df_sorted.index[:2])
    topN_user[i] = top2
    
pprint(topN_user)

insert image description here

过滤掉已购买物品,筛选出其余的东西

# 相似用户的物品,并过滤已购买的东西
rs_results = {
    
    }
for user, sim_users in topN_user.items():
    rs_result = set()
    for sim_user in sim_users:
        # 将所有推荐人买过的东西合并在一起
        rs_result = rs_result.union(set(df.loc[sim_user].replace(0,np.nan).dropna().index))
        
        # 过滤掉自己已经买过的东西
    rs_result -= set(df.loc[user].replace(0,np.nan).dropna().index)
   
    rs_results[user] = rs_result
    
pprint(rs_results)

insert image description here

基于物品相似度


# 计算物品间相似度
item_similar = 1 - pairwise_distances(df.T.values, metric='jaccard')
item_similar = pd.DataFrame(item_similar, columns=items, index=items)
print("物品之间的两两相似度:")
print(item_similar)

计算物品相似top2物品

topN_items = {
    
    }
for i in item_similar.index:
    # 取出每列数据,并删除自己的数据
    df_ = item_similar.loc[i].drop([i])
    # 按照相似度降序排序
    df_sorted = df_.sort_values(ascending=False)
    # 取前两条结果
    top2 = list(df_sorted.index[:2])
    topN_items[i] = top2
    
pprint(topN_items)

insert image description here

构建推荐列表

it_results = {
    
    }
# 构建推荐结果
for user in df.index: # 遍历所有⽤户
    it_result = set()
    for item in df.loc[user].replace(0,np.nan).dropna().index: # 取出每个⽤户当前已购物品列表
    # 根据每个物品找出最相似的TOP-N物品,构建初始推荐结果
        it_result = it_result.union(topN_items[item])
       
    # 过滤掉⽤户已购的物品
    it_result -= set(df.loc[user].replace(0,np.nan).dropna().index)
    # 添加到结果中
    it_results[user] = it_result
print("最终推荐结果:")
pprint(it_results)

insert image description here

5 推荐系统评估

好的算法可以实现三方共赢,即用户满足、服务方实现商业价值、内容方获得收益。其中评估的数据也分为直接评估和间接评估,像电影评分或者推荐量表示用户喜欢该内容的属于直接评估,准确性高,但数量少获取成本也较高;更多需要我们间接评估像播放量、点击量、购买量、评论和下载等等,这种方式虽然准确性较低,但数量多成本少。

评估指标

常用的评估指标如下:
• 准确性 • 信任度 • 满意度 • 实时性 • 覆盖率 • 鲁棒性 • 多样性 • 可扩展性 • 新颖性 • 商业⽬标 • 惊喜度 • ⽤户留存

评估方法

  • 问卷调查:成本高
  • 离线评估:只能评估少数指标,与线上真实效果存在偏差
  • 在线评估:灰度发布/AB测试

一般会采用离线评估和在线评估相结合,然后定期做问卷调查。

推荐系统冷启动

推荐系统冷启动本质上是缺失历史数据的情况下,怎么预测用户的偏好。其中可分为用户冷启动、物品冷启动和系统冷启动。

用户冷启动:

That is, how to make personalized recommendations for new users. Generally, user characteristic data will be collected as much as possible:

  1. Collect basic information of users: gender, age, region, mobile phone model, GPS location and APP list
  2. Guide users to fill in their interests, that is, to enter the APP's interest selection
    insert image description here
  3. Behavioral data associated with other apps, for example, Tencent products will be associated with QQ and WeChat
  4. Differences in recommendations between new and old users: general new users recommend popular ones, while old users recommend personalized ones.

Item cold start:

How to recommend new items to users.

  1. Tag items: Tags are generated from system services and can also be crawled from other websites.
  2. Use an item's tags to recommend it to users who have liked similar items.

System cold start:

User cold start + item cold start.

  1. Early system based on content recommendation;
  2. Then transition to collaborative filtering;
  3. Combination of content recommendation and collaborative filtering.

Guess you like

Origin blog.csdn.net/gjinc/article/details/132105404