推荐系统综述与代码
By Joey周琦
引言与符号介绍
一般来说,推荐系统可以归纳为,预测user对某item的评分或者点击率。问题描述如下
user对item的互动,主要可以分为下面三个方面:
- scalar. (numerical(rating),ordinal). 标量的
- binary. (like,dislike) 二项的,0 or 1, 点or不点等
- unary. (purchase,online access,etc) 。一元的,如购买,在线等
为了表述清晰,我们做一下符号说明
- item:商品,user:用户 (后面可能会混合叫)
-
U : 用户集合 -
I : 商品集合 -
R : 用户对商品的评分集合(或点击率等) -
S : 评分集合的可能值域 (e,g,S=[1,...,5]orlike,dislike ) -
rui : 用户u给item i 的评分 -
Ui : 评价过商品i的用户集合 -
Iu : 用户u评价过的item集合 -
Iuv=Iu∩Iv ,Uij=Ui∩Uj -
N(u) : 用户u的KNN(K个最近邻) -
Ni(u) :评价过item i的用户u的KNN最近邻 -
Nu(i) : 被用户u评价过的商品i的最近邻
若用效用函数
用户空间
商品空间
用户的explicit feedback (显性反馈)行为可以分为下面三种,这类信息比较少,但比较准确。
* like/dislike
* ratings
* text comments
用户的implicit feedback (隐形反馈)行为可以为下面这些:保存,点击,放弃,打印,标签。这些行为不需要用户直接性的参与。这类信息比较多,但没有显性那么准确。
一般来说推荐系统可以分为两类
- 基于内容的推荐系统 (Content-based recommendations,CB). 根据用户的浏览历史item的内容,给用户推荐相似内容的item
- 协同滤波推荐系统 (Collaborative filtering recommendations, CF). 给用户推荐与其相似用户看过的商品user-CF,或给用户推荐其看过的商品的相似商品item-CF. 这里的相似不是从内容分析,而是看过相同内容越多的用户则越相似,被相同用户看过越多则越相似。
- 混合上述方法
基于内容的推荐系统Content based
基于内容的推荐系统(CB)通过分析user过去的评分、点击等行为,为每一个user建立一个画像(profile)或模型(model). 画像可以结构化代表user的兴趣等,可以用来给用户带来新的推荐。基于内容的信息过滤需要一些技术(自然语言处理,信息提取,如tf-idf等)来代表每个物品item和用户画像,并且需要策略对比用户画像与item的相似度。
基于内容的效用函数可以定义为
-
Content(i) 是item的画像.ContentBasedProfile(u) is the用户u的画像。 -
Content(i) andContentBasedProfile(u) 可以用tf-idf向量(或其他技术)代表。
很多机器学习的方法,如朴素贝叶斯、神经网络、决策树等算法可以应用于基于内容的推荐中。(个人理解是利用向量代表作为特征,点击或平分作为label,来训练分类算法?)
基于内容的推荐框架如下图
CB有如下优势:
- USER INDEPENDENCE - Content-based recommenders exploit solely ratings provided by the active user to build her own profile
- 用户独立性:只根据用户自己的历史行为构建用户画像
- 可解释性强
- CB可以推荐没有任何行为的item,因为可以通过内容分析
CB的一些限制:
- 有限的内容分析:对于声音、图像的内容分析技术有限
- 推荐过细:系统只会推荐和用户历史行为内容相关的item,缺乏新颖性
- 对新用户推荐有难度(因为没有历史行为)
代表item(Item Representation)
在大部分cb系统中,item的描述是从网页,文章,评论,内容等重提取的文字特征。由于语言的多义性(language ambiguity),对于文字特征,一些复杂的问题需要处理:
- POLYSEMY, the presence of multiple meanings for one word;
- POLYSEMY, 一词多义
- SYNONYMY, 多词同义
基于关键词的向量空间模型
大部分基于内容的推荐系统都是比较简单的检索模型,如关键字匹配或向量空间模型(基于td-idf权重)。 假设
Most content-based recommender systems use relatively simple retrieval models, such as keyword matching or the Vector Space Model (VSM) with basic TF-IDF weighting. Let
some standard natural language processing operations, such as tokenization, stopwords
removal, and stemming. Each item
- rare terms are not less relevant than frequent terms (IDF assumption);
- multiple occurrences of a term in a document are not less relevant than single
occurrences (TF assumption); - long documents are not preferred to short documents (normalization assumption).
TF-IDF is calculated as follows:
The second term is IDF, and the first term TF can be calculated as follows
where
And the cosine similarity is the most widely used:
General Content based approach
In some way, we can find a vector to represent the item i by
Methods for learning user profiles
Probabilistic Methods and Naive Bayes
Relevance Feedback and Rocchio’s Algorithm
Collaborative methods
Algorithms for collaborative recommendation can be grouped into two general classes: \emph{memory based (or heuristic based) and model based}
memory based \ neighborhood approaches
user based rating
The value of the unknown rating
where
Taking into account the fact that different user has different rating scale gives the Eq(c)(Some may tend to give high ratings, some tend to give low ratings). Z-score normalization is another normalization that has considered the rating variance.
User-based classification
If the rating can take long value then regression is more appropriate. On the other hand if the rating has a few discrete value, then classification may be preferable.
\subsubsection{item based rating}
userbased VS item based
- the number of user << item -> user based
- the number of user >> item -> item based
- user based more likely to make serendipitous recommendation
- justifiability item > user
Similarity computation
cosine vector similarity:
Pearson Correlation (PC) similarity:
For items similarity:
Adjusted cosine (AC) similarity
Other similarity measures: MSD, SRC (consider the ranking of ratings)
Neighborhood selection
In large recommender systems that can have millions of users and items, it is usually
not possible to store the (non-zero) similarities between each pair of users or items,
due to memory limitations. Moreover, doing so would be extremely wasteful as only
the most significant of these values are used in the predictions. The pre-filtering of neighbors is an essential step that makes neighborhood-based approaches practicable
by reducing the amount of similarity weights to store, and limiting the number of
candidate neighbors to consider in the predictions. Several ways are listed as follows.
- top-N filtering
- threshold filtering
- Negative filtering
After pre-filtering, we can choose kNN from the filtered neighborhoods. The choice of
Advantage and disadvantages
Advantages compared to the model based
- simplicity
- justifiability
- efficiency: the nearest neighbors can be pre-computed in an offline step and provide near instantaneous recommendations
- stability
- bring some ‘serendipity’ (novelty)
Model based
Neighborhoods method has the following two important flaws:
- users can be neighbors
only if they have rated common items. This assumption is very limiting, as users
having rated a few or no common items may still have similar preferences - the accuracy of neighborhood-based recommendation
methods suffers from the lack of available ratings. Sparsity is a
problem common to most recommender systems due to the fact that users typically
rate only a small proportion of the available items. (sometimes CB can be used to give some rating, but it is limited)
Model based method use the collection of ratings to learn a model which is then used to make rating predictions. For example
rating values are integers between 0 and
The limitation of model based method is
- New user problem
- New item problem
- Sparsity
代码实验
预测是否观看某电影
推荐系统实践中有一个观点,给一个用户推荐一个item不是因为预测该用户会对该item打分数高低,而是预测该用户对该item有打分或行为的可能性。下面我们做利用movielens小型数据做实验,相似度分别计算标准如下:
1. 用户u对商品i评分过,则
- 参数选择(近邻个数K=7)
- 聚合函数选择
- 召回率,准确率:(0.11920293092663747, 0.20491551459293394)
2. 用户u对商品i评分过,则
- 参数选择(近邻个数K=7)
- 聚合函数选择
- 召回率,准确率:(0.11973907604324904, 0.20583717357910905)
3. 单单推荐预测评分高的,效果交差
- 聚合函数选择
- 召回率,准确率:(0.015190778303994281, 0.026113671274961597)
结论,预测评分高不代表该用户对电影就一定有行为,而预测用户是否会去看某电影,是比较合理的(对于movielens的这份数据而言)。
预测用户对某电影的评分
上面是预测用户是否会看电影,而另外一部分算法是用于预测用户对某电影的评分(喜爱程度)。协同滤波可以解决这类问题,主要分两类:
- 基于邻域的协同滤波,user-cf, item-cf
- 基于模型的协同滤波, SVD等
本文试验了user-cf与SVD算法在movielens数据上的效果,user-cf的预测分数rmse误差为1.007,svd的rmse为0.97,在该数据集上svd的预测效果更优。
数据和代码下载链接http://download.csdn.net/detail/u011467621/9133001
后续更新的代码在
https://github.com/joeyqzhou/recommendation-system
参考文献
- Ricci, Francesco, Lior Rokach, and Bracha Shapira. Introduction to recommender systems handbook. Springer US, 2011.
- 推荐系统实践[M]. 项亮.人民邮电出版社, 2012.
-
注:后续有空会补上翻译
个人水平有限,欢迎发邮件[email protected]或评论讨论