推荐系统综述与代码

版权声明: 本文为博主原创文章,未经博主允许不得转载 https://blog.csdn.net/u011467621/article/details/48624973

推荐系统综述与代码

By Joey周琦

引言与符号介绍

一般来说,推荐系统可以归纳为,预测user对某item的评分或者点击率。问题描述如下

user对item的互动,主要可以分为下面三个方面:

  • scalar. (numerical(rating),ordinal). 标量的
  • binary. (like,dislike) 二项的,0 or 1, 点or不点等
  • unary. (purchase,online access,etc) 。一元的,如购买,在线等

为了表述清晰,我们做一下符号说明

  • item:商品,user:用户 (后面可能会混合叫)
  • U : 用户集合
  • I : 商品集合
  • R : 用户对商品的评分集合(或点击率等)
  • S : 评分集合的可能值域 (e,g, S=[1,...,5]orlike,dislike )
  • rui : 用户u给item i 的评分
  • Ui : 评价过商品i的用户集合
  • Iu : 用户u评价过的item集合
  • Iuv=IuIv , Uij=UiUj
  • N(u) : 用户u的KNN(K个最近邻)
  • Ni(u) :评价过item i的用户u的KNN最近邻
  • Nu(i) : 被用户u评价过的商品i的最近邻

若用效用函数 f 衡量用户u对商品i的兴趣, i.e, f:U×I>S 。那么对于用户 uU , 我们希望选择一个 iI 最大化用户的兴趣. 如下:

uU,  iu=argmaxiIf(u,i).

用户空间 U 的每个用户可以由一个profile(用户资料,画像)来代表,这个profile可以包含 用户基础属性(性别,年龄,地域,学历),历史行为(用户ID)等。
商品空间 I 的每个商品可以由商品的内容(标题,类别,等)和历史行为(ID)来代表。

用户的explicit feedback (显性反馈)行为可以分为下面三种,这类信息比较少,但比较准确。
* like/dislike
* ratings
* text comments

用户的implicit feedback (隐形反馈)行为可以为下面这些:保存,点击,放弃,打印,标签。这些行为不需要用户直接性的参与。这类信息比较多,但没有显性那么准确。

一般来说推荐系统可以分为两类

  • 基于内容的推荐系统 (Content-based recommendations,CB). 根据用户的浏览历史item的内容,给用户推荐相似内容的item
  • 协同滤波推荐系统 (Collaborative filtering recommendations, CF). 给用户推荐与其相似用户看过的商品user-CF,或给用户推荐其看过的商品的相似商品item-CF. 这里的相似不是从内容分析,而是看过相同内容越多的用户则越相似,被相同用户看过越多则越相似。
  • 混合上述方法

基于内容的推荐系统Content based

基于内容的推荐系统(CB)通过分析user过去的评分、点击等行为,为每一个user建立一个画像(profile)或模型(model). 画像可以结构化代表user的兴趣等,可以用来给用户带来新的推荐。基于内容的信息过滤需要一些技术(自然语言处理,信息提取,如tf-idf等)来代表每个物品item和用户画像,并且需要策略对比用户画像与item的相似度。

基于内容的效用函数可以定义为 f(u,i) :

f(u,i)=score(ContentBasedProfile(u),Content(i))

  • Content(i) 是item的画像. ContentBasedProfile(u) is the用户u的画像。
  • Content(i) and ContentBasedProfile(u) 可以用tf-idf向量(或其他技术)代表。

很多机器学习的方法,如朴素贝叶斯、神经网络、决策树等算法可以应用于基于内容的推荐中。(个人理解是利用向量代表作为特征,点击或平分作为label,来训练分类算法?)

基于内容的推荐框架如下图

这里写图片描述

CB有如下优势:

  • USER INDEPENDENCE - Content-based recommenders exploit solely ratings provided by the active user to build her own profile
  • 用户独立性:只根据用户自己的历史行为构建用户画像
  • 可解释性强
  • CB可以推荐没有任何行为的item,因为可以通过内容分析

CB的一些限制:

  • 有限的内容分析:对于声音、图像的内容分析技术有限
  • 推荐过细:系统只会推荐和用户历史行为内容相关的item,缺乏新颖性
  • 对新用户推荐有难度(因为没有历史行为)

代表item(Item Representation)

在大部分cb系统中,item的描述是从网页,文章,评论,内容等重提取的文字特征。由于语言的多义性(language ambiguity),对于文字特征,一些复杂的问题需要处理:

  • POLYSEMY, the presence of multiple meanings for one word;
  • POLYSEMY, 一词多义
  • SYNONYMY, 多词同义

基于关键词的向量空间模型

大部分基于内容的推荐系统都是比较简单的检索模型,如关键字匹配或向量空间模型(基于td-idf权重)。 假设
Most content-based recommender systems use relatively simple retrieval models, such as keyword matching or the Vector Space Model (VSM) with basic TF-IDF weighting. Let D=d1,d2,,dN denote a set of documents or corpus, and T=t1,t2,,TN be the dictionary, that is to say the set of words in the corpus. T is obtained by applying
some standard natural language processing operations, such as tokenization, stopwords
removal, and stemming. Each item dj is represented as a vector in a n -dimensional vector space, so dj=w1j,w2j,,wnj ,where wkj is the weight for term tk in corpus dj . TF-IDF weighting is based on the assumption that:

  • rare terms are not less relevant than frequent terms (IDF assumption);
  • multiple occurrences of a term in a document are not less relevant than single
    occurrences (TF assumption);
  • long documents are not preferred to short documents (normalization assumption).

TF-IDF is calculated as follows:

TFIDF(tk,dj)=TF(tk,dj)log˙Nnk

The second term is IDF, and the first term TF can be calculated as follows

TF(tk,dj)=fk,jmaxzfz,j

where fk,j is the frequencies that term tk occur in document dj . Then we get the weight after cosine normalization
wk,j=TFIDF(tk,dj)ni=1TFIDF(tk,dj)

And the cosine similarity is the most widely used:

sim(di,dj)=kwkiwkjkw2kikw2kj

General Content based approach

In some way, we can find a vector to represent the item i by xi . And the user profile vector xu can be represented as

xu=iIuruixi

Methods for learning user profiles

Probabilistic Methods and Naive Bayes

Relevance Feedback and Rocchio’s Algorithm

Collaborative methods

Algorithms for collaborative recommendation can be grouped into two general classes: \emph{memory based (or heuristic based) and model based}

memory based \ neighborhood approaches

user based rating

The value of the unknown rating ri,u is for user u and item i is usually computed as an aggregate of the ratings of some other users for the same item i :

ru,i=aggrvNi(u)rv,i

where Ni(u) denotes the set of K user that are the most similar to user u and who have rated item i . Some examples of the aggregation function are:

(a)ru,i=1|Ni(u)|vNi(u)rv,i(b)ru,i=vNi(u)wuv×rv,ivNi(u)wuv(c)ru,i=r¯u+vNi(u)wuv×(rv,ir¯v)vNi(u)wuv

Taking into account the fact that different user has different rating scale gives the Eq(c)(Some may tend to give high ratings, some tend to give low ratings). Z-score normalization is another normalization that has considered the rating variance.

User-based classification

r̂ ui=argmaxrSvNi(u)δ(h(rvi)==r)wuv

If the rating can take long value then regression is more appropriate. On the other hand if the rating has a few discrete value, then classification may be preferable.
\subsubsection{item based rating}

ru,i=jNu(i)wij×ru,jjNu(i)wij

userbased VS item based

  • the number of user << item -> user based
  • the number of user >> item -> item based
  • user based more likely to make serendipitous recommendation
  • justifiability item > user

Similarity computation

cosine vector similarity:

CV(u,v)=cos(xu,xv)=iIuvruirviiIur2uiiIvr2vi

Pearson Correlation (PC) similarity:

PC(u,v)=iIuv(ruir¯u)(rvir¯v)iIuv(ruir¯u)2iIuv(rvir¯v)2

For items similarity:

PC(i,j)=uUij(ruir¯i)(rujr¯j)uUij(ruir¯i)2uUij(rujr¯j)2

Adjusted cosine (AC) similarity

AC(i,j)=uUij(ruir¯u)(rujr¯u)uUij(ruir¯u)2uUij(rujr¯u)2

Other similarity measures: MSD, SRC (consider the ranking of ratings)

Neighborhood selection

In large recommender systems that can have millions of users and items, it is usually
not possible to store the (non-zero) similarities between each pair of users or items,
due to memory limitations. Moreover, doing so would be extremely wasteful as only
the most significant of these values are used in the predictions. The pre-filtering of neighbors is an essential step that makes neighborhood-based approaches practicable
by reducing the amount of similarity weights to store, and limiting the number of
candidate neighbors to consider in the predictions. Several ways are listed as follows.

  • top-N filtering
  • threshold filtering
  • Negative filtering

After pre-filtering, we can choose kNN from the filtered neighborhoods. The choice of k is important, which can be obtained by cross validation.

Advantage and disadvantages

Advantages compared to the model based

  • simplicity
  • justifiability
  • efficiency: the nearest neighbors can be pre-computed in an offline step and provide near instantaneous recommendations
  • stability
  • bring some ‘serendipity’ (novelty)

Model based

Neighborhoods method has the following two important flaws:

  • users can be neighbors
    only if they have rated common items. This assumption is very limiting, as users
    having rated a few or no common items may still have similar preferences
  • the accuracy of neighborhood-based recommendation
    methods suffers from the lack of available ratings. Sparsity is a
    problem common to most recommender systems due to the fact that users typically
    rate only a small proportion of the available items. (sometimes CB can be used to give some rating, but it is limited)

Model based method use the collection of ratings to learn a model which is then used to make rating predictions. For example

ru,i=E(ru,i)=k=0nk×Pr(ru,i=k|ru,j,j\inIu)

rating values are integers between 0 and n . Pr is the probability. Cluster models, Bayesian networks, Gibbs sampling, Markov decision process, probabilistic latent semantic analysis, latent dirichlet allocation, et al can be used for this method.

The limitation of model based method is

  • New user problem
  • New item problem
  • Sparsity

代码实验

预测是否观看某电影

推荐系统实践中有一个观点,给一个用户推荐一个item不是因为预测该用户会对该item打分数高低,而是预测该用户对该item有打分或行为的可能性。下面我们做利用movielens小型数据做实验,相似度分别计算标准如下:
1. 用户u对商品i评分过,则 ru,i=1 ,否则 ru,i=0 . ru,i 不代表评分,而是表示用户u点击商品j的倾向,数值越大表示倾向越大。
- 参数选择(近邻个数K=7)
- 聚合函数选择 vNi(u)wuv×rv,i==1 , 其中 Ni(u) 固定为7
- 召回率,准确率:(0.11920293092663747, 0.20491551459293394)
2. 用户u对商品i评分过,则 ru,i= ,否则 ru,i=0 . ru,i 不代表评分,而是表示用户u点击商品j的倾向,数值越大表示倾向越大。
- 参数选择(近邻个数K=7)
- 聚合函数选择 vNi(u)wuv×rv,i , 其中 Ni(u) 固定为7. (可以看出这里不仅考虑了是否评分,还考虑了评分的大小,两个用户之间的相似度)
- 召回率,准确率:(0.11973907604324904, 0.20583717357910905)
3. 单单推荐预测评分高的,效果交差
- 聚合函数选择 ru,i=vNi(u)wuv×rv,ivNi(u)wuv
- 召回率,准确率:(0.015190778303994281, 0.026113671274961597)

结论,预测评分高不代表该用户对电影就一定有行为,而预测用户是否会去看某电影,是比较合理的(对于movielens的这份数据而言)。

预测用户对某电影的评分

上面是预测用户是否会看电影,而另外一部分算法是用于预测用户对某电影的评分(喜爱程度)。协同滤波可以解决这类问题,主要分两类:

  • 基于邻域的协同滤波,user-cf, item-cf
  • 基于模型的协同滤波, SVD等

本文试验了user-cf与SVD算法在movielens数据上的效果,user-cf的预测分数rmse误差为1.007,svd的rmse为0.97,在该数据集上svd的预测效果更优。

数据和代码下载链接http://download.csdn.net/detail/u011467621/9133001

后续更新的代码在
https://github.com/joeyqzhou/recommendation-system

参考文献


  • Ricci, Francesco, Lior Rokach, and Bracha Shapira. Introduction to recommender systems handbook. Springer US, 2011.
  • 推荐系统实践[M]. 项亮.人民邮电出版社, 2012.

注:后续有空会补上翻译
个人水平有限,欢迎发邮件[email protected]或评论讨论

猜你喜欢

转载自blog.csdn.net/u011467621/article/details/48624973