推荐系统综述与代码

By Joey周琦

引言与符号介绍

一般来说，推荐系统可以归纳为，预测user对某item的评分或者点击率。问题描述如下

user对item的互动，主要可以分为下面三个方面：

scalar. (numerical(rating),ordinal). 标量的
binary. (like,dislike) 二项的，0 or 1, 点or不点等
unary. (purchase,online access,etc) 。一元的，如购买，在线等

为了表述清晰，我们做一下符号说明

item：商品，user：用户（后面可能会混合叫）
$U$ : 用户集合
$I$ : 商品集合
$R$ : 用户对商品的评分集合（或点击率等）
$S$ : 评分集合的可能值域 (e,g, $S=[1,...,5] or {like,dislike}$ )
$r_{ui}$ : 用户u给item i 的评分
$U_i$ : 评价过商品i的用户集合
$I_u$ : 用户u评价过的item集合
$I_{uv}=I_u \cap I_v$ , $U_{ij}=U_i \cap U_j$
$N(u)$ : 用户u的KNN(K个最近邻）
$N_i(u)$ :评价过item i的用户u的KNN最近邻
$N_u(i)$ : 被用户u评价过的商品i的最近邻

若用效用函数 $f$ 衡量用户u对商品i的兴趣， i.e, $f:U\times I ->S$ 。那么对于用户 $u \in U$ , 我们希望选择一个 $i \in I$ 最大化用户的兴趣. 如下：

\forall u \in U, i u = arg max i \in I f (u, i) .

$\begin{equation} \forall_{u \in U}, ~~ i_u = \arg \max_{i\in I} f(u,i). \end{equation}$

用户空间 $U$ 的每个用户可以由一个profile(用户资料，画像）来代表，这个profile可以包含用户基础属性（性别，年龄，地域，学历），历史行为（用户ID)等。
商品空间 $I$ 的每个商品可以由商品的内容（标题，类别，等）和历史行为（ID)来代表。

用户的explicit feedback (显性反馈）行为可以分为下面三种，这类信息比较少，但比较准确。
* like/dislike
* ratings
* text comments

用户的implicit feedback (隐形反馈）行为可以为下面这些：保存，点击，放弃，打印，标签。这些行为不需要用户直接性的参与。这类信息比较多，但没有显性那么准确。

一般来说推荐系统可以分为两类

基于内容的推荐系统（Content-based recommendations,CB）. 根据用户的浏览历史item的内容，给用户推荐相似内容的item
协同滤波推荐系统 (Collaborative filtering recommendations, CF). 给用户推荐与其相似用户看过的商品user-CF，或给用户推荐其看过的商品的相似商品item-CF. 这里的相似不是从内容分析，而是看过相同内容越多的用户则越相似，被相同用户看过越多则越相似。
混合上述方法

基于内容的推荐系统Content based

基于内容的推荐系统（CB)通过分析user过去的评分、点击等行为，为每一个user建立一个画像(profile)或模型(model). 画像可以结构化代表user的兴趣等，可以用来给用户带来新的推荐。基于内容的信息过滤需要一些技术（自然语言处理，信息提取，如tf-idf等）来代表每个物品item和用户画像，并且需要策略对比用户画像与item的相似度。

基于内容的效用函数可以定义为 $f (u,i)$ :

f (u, i) = s c o r e (C o n t e n t B a s e d P r o f i l e (u), C o n t e n t (i))

$\begin{equation} f (u,i) = score(ContentBasedProfile(u),Content(i)) \end{equation}$

$Content(i)$ 是item的画像. $ContentBasedProfile(u)$ is the用户u的画像。
$Content(i)$ and $ContentBasedProfile(u)$ 可以用tf-idf向量（或其他技术）代表。

很多机器学习的方法，如朴素贝叶斯、神经网络、决策树等算法可以应用于基于内容的推荐中。（个人理解是利用向量代表作为特征，点击或平分作为label,来训练分类算法？）

基于内容的推荐框架如下图

这里写图片描述

CB有如下优势：

USER INDEPENDENCE - Content-based recommenders exploit solely ratings provided by the active user to build her own profile
用户独立性：只根据用户自己的历史行为构建用户画像
可解释性强
CB可以推荐没有任何行为的item,因为可以通过内容分析

CB的一些限制：

有限的内容分析：对于声音、图像的内容分析技术有限
推荐过细：系统只会推荐和用户历史行为内容相关的item,缺乏新颖性
对新用户推荐有难度（因为没有历史行为)

代表item(Item Representation)

在大部分cb系统中，item的描述是从网页，文章，评论，内容等重提取的文字特征。由于语言的多义性（language ambiguity），对于文字特征，一些复杂的问题需要处理：

POLYSEMY, the presence of multiple meanings for one word;
POLYSEMY, 一词多义
SYNONYMY, 多词同义

基于关键词的向量空间模型

大部分基于内容的推荐系统都是比较简单的检索模型，如关键字匹配或向量空间模型（基于td-idf权重)。假设
Most content-based recommender systems use relatively simple retrieval models, such as keyword matching or the Vector Space Model (VSM) with basic TF-IDF weighting. Let $D ={d_1,d_2,\cdots,d_N}$ denote a set of documents or corpus, and $T={t_1,t_2,\cdots, T_N}$ be the dictionary, that is to say the set of words in the corpus. $T$ is obtained by applying
some standard natural language processing operations, such as tokenization, stopwords
removal, and stemming. Each item $d_j$ is represented as a vector in a $n$ -dimensional vector space, so $d_j={w_{1j},w_{2j},\cdots,w_{nj}}$ ,where $w_{kj}$ is the weight for term $t_k$ in corpus $d _j$ . TF-IDF weighting is based on the assumption that:

rare terms are not less relevant than frequent terms (IDF assumption);
multiple occurrences of a term in a document are not less relevant than single
occurrences (TF assumption);
long documents are not preferred to short documents (normalization assumption).

TF-IDF is calculated as follows:

T F - I D F (t k, d j) = T F (t k, d j) log ˙ N n k

$\begin{equation} TF-IDF(t_k,d_j) = TF(t_k,d_j) \dot \log \frac{N}{n_k} \end{equation}$

The second term is IDF, and the first term TF can be calculated as follows

T F (t k, d j) = f k , j max z f z , j

$\begin{equation} TF(t_k,d_j) = \frac {f_{k,j}} {\max_z f_{z,j}} \end{equation}$
where

fk,j $f_{k,j}$ is the frequencies that term

tk $t_k$ occur in document

dj $d_j$ . Then we get the weight after cosine normalization

w k, j = T F - I D F ( t k , d j ) \sum n i = 1 T F - I D F ( t k , d j ) ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt

$\begin{equation} w_{k,j} = \frac{TF-IDF(t_k,d_j)} {\sqrt{ \sum \nolimits_{i=1}^n TF-IDF(t_k,d_j)}} \end{equation}$

And the cosine similarity is the most widely used:

s i m (d i, d j) = \sum k w k i w k j \sum k w 2 k i ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt \sum k w 2 k j ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt

$\begin{equation} sim(d_i,d_j) = \frac{\sum_k w_{ki}w_{kj}} {\sqrt{\sum_k w_{ki}^2}\sqrt{\sum_k w_{kj}^2}} \end{equation}$

General Content based approach

In some way, we can find a vector to represent the item i by $\mathbf{x}_i$ . And the user profile vector $\mathbf{x}_u$ can be represented as

x u = \sum i \in I u r u i x i

$\begin{equation} \mathbf{x}_u = \sum \limits_{i \in I_u} r_{ui} \mathbf{x}_i \end{equation}$

Methods for learning user profiles

Probabilistic Methods and Naive Bayes

Relevance Feedback and Rocchio’s Algorithm

Collaborative methods

Algorithms for collaborative recommendation can be grouped into two general classes: \emph{memory based (or heuristic based) and model based}

memory based \ neighborhood approaches

user based rating

The value of the unknown rating $r_{i,u}$ is for user $u$ and item $i$ is usually computed as an aggregate of the ratings of some other users for the same item $i$ :

r u, i = aggr v \in N i (u) r v, i

$r_{u,i} =\text{aggr} _ {v \in N_i(u)} \quad r_{v,i}$

where $N_i(u)$ denotes the set of $K$ user that are the most similar to user $u$ and who have rated item $i$ . Some examples of the aggregation function are:

(a) r u, i = 1 | N i ( u ) | \sum v \in N i (u) r v, i (b) r u, i = \sum v \in N i ( u ) w u v \times r v , i \sum v \in N i ( u ) w u v (c) r u, i = r ¯ u + \sum v \in N i ( u ) w u v \times ( r v , i - r ¯ v ) \sum v \in N i ( u ) w u v

$\begin{eqnarray} (a) r_{u,i} = \frac{1}{|N_i(u)|} \sum\limits_{v \in N_i(u)} r_{v,i} \\ (b) r_{u,i} = \frac {\sum\limits_{v \in N_i(u)} w_{uv}\times r_{v,i}} {\sum \limits_{v \in N_i(u)} w_{uv}} \\ (c) r_{u,i} = \bar r_u + \frac {\sum\limits_{v \in N_i(u)} w_{uv} \times (r_{v,i}-\bar r_{v})} {\sum \limits_{v \in N_i(u)} w_{uv}} \\ \end{eqnarray}$

Taking into account the fact that different user has different rating scale gives the Eq(c)(Some may tend to give high ratings, some tend to give low ratings). Z-score normalization is another normalization that has considered the rating variance.

User-based classification

r ̂ u i = arg max r \in S \sum v \in N i (u) δ (h (r v i) = = r) w u v

$\begin{equation} \hat r_{ui} = \arg \mathop {\max}_{r \in S} \sum \limits_{v \in N_i(u)} \delta (h(r_{vi})==r)w_{uv} \end{equation}$

If the rating can take long value then regression is more appropriate. On the other hand if the rating has a few discrete value, then classification may be preferable.
\subsubsection{item based rating}

r u, i = \sum j \in N u ( i ) w i j \times r u , j \sum j \in N u ( i ) w i j

$\begin{equation} r_{u,i} = \frac {\sum\limits_{j \in N_u(i)} w_{ij}\times r_{u,j}} {\sum \limits_{j \in N_u(i)} w_{ij}} \end{equation}$

userbased VS item based

the number of user << item -> user based
the number of user >> item -> item based
user based more likely to make serendipitous recommendation
justifiability item > user

Similarity computation

cosine vector similarity:

C V (u, v) = cos (x u, x v) = \sum i \in I u v r u i r v i \sum i \in I u r 2 u i \sum i \in I v r 2 v i ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt

$\begin{equation} CV(u,v) = \cos(x_u,x_v) = \frac{\sum \limits_{i \in I_{uv}} r_{ui} r_{vi}} {\sqrt{\sum \limits_{i \in I_{u}} r_{ui}^2 \sum \limits_{i \in I_{v}} r_{vi}^2} } \end{equation}$

Pearson Correlation (PC) similarity:

P C (u, v) = \sum i \in I u v ( r u i - r ¯ u ) ( r v i - r ¯ v ) \sum i \in I u v ( r u i - r ¯ u ) 2 \sum i \in I u v ( r v i - r ¯ v ) 2 ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt

$\begin{equation} PC(u,v) = \frac{\sum \limits_{i \in I_{uv}} (r_{ui}-\bar r_u) (r_{vi}-\bar r_v)} {\sqrt{\sum \limits_{i \in I_{uv}} (r_{ui}-\bar r_u)^2 \sum \limits_{i \in I_{uv}} (r_{vi}-\bar r_v)^2} } \end{equation}$

For items similarity:

P C (i, j) = \sum u \in U i j ( r u i - r ¯ i ) ( r u j - r ¯ j ) \sum u \in U i j ( r u i - r ¯ i ) 2 \sum u \in U i j ( r u j - r ¯ j ) 2 ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt

$\begin{equation} PC(i,j) = \frac{\sum \limits_{u \in U_{ij}} (r_{ui}-\bar r_i) (r_{uj}-\bar r_j)} {\sqrt{\sum \limits_{u \in U_{ij}} (r_{ui}-\bar r_i)^2 \sum \limits_{u \in U_{ij}} (r_{uj}-\bar r_j)^2} } \end{equation}$

Adjusted cosine (AC) similarity

A C (i, j) = \sum u \in U i j ( r u i - r ¯ u ) ( r u j - r ¯ u ) \sum u \in U i j ( r u i - r ¯ u ) 2 \sum u \in U i j ( r u j - r ¯ u ) 2 ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt

$\begin{equation} AC(i,j) = \frac{\sum \limits_{u \in U_{ij}} (r_{ui}-\bar r_u) (r_{uj}-\bar r_u)} {\sqrt{\sum \limits_{u \in U_{ij}} (r_{ui}-\bar r_u)^2 \sum \limits_{u \in U_{ij}} (r_{uj}-\bar r_u)^2} } \end{equation}$

Other similarity measures: MSD, SRC (consider the ranking of ratings)

Neighborhood selection

In large recommender systems that can have millions of users and items, it is usually
not possible to store the (non-zero) similarities between each pair of users or items,
due to memory limitations. Moreover, doing so would be extremely wasteful as only
the most significant of these values are used in the predictions. The pre-filtering of neighbors is an essential step that makes neighborhood-based approaches practicable
by reducing the amount of similarity weights to store, and limiting the number of
candidate neighbors to consider in the predictions. Several ways are listed as follows.

top-N filtering
threshold filtering
Negative filtering

After pre-filtering, we can choose kNN from the filtered neighborhoods. The choice of $k$ is important, which can be obtained by cross validation.

Advantage and disadvantages

Advantages compared to the model based

simplicity
justifiability
efficiency: the nearest neighbors can be pre-computed in an offline step and provide near instantaneous recommendations
stability
bring some ‘serendipity’ (novelty)

Model based

Neighborhoods method has the following two important flaws:

users can be neighbors
only if they have rated common items. This assumption is very limiting, as users
having rated a few or no common items may still have similar preferences
the accuracy of neighborhood-based recommendation
methods suffers from the lack of available ratings. Sparsity is a
problem common to most recommender systems due to the fact that users typically
rate only a small proportion of the available items. (sometimes CB can be used to give some rating, but it is limited)

Model based method use the collection of ratings to learn a model which is then used to make rating predictions. For example

ru,i=E(ru,i)=∑k=0nk×Pr(ru,i=k|ru,j,j\inIu)

$\begin{equation} r_{u,i}=E(r_{u,i})= \sum \limits_{k=0}^{n} k \times \text{Pr}(r_{u,i}=k|r_{u,j},j \inI_u) \end{equation}$

rating values are integers between 0 and $n$ . Pr is the probability. Cluster models, Bayesian networks, Gibbs sampling, Markov decision process, probabilistic latent semantic analysis, latent dirichlet allocation, et al can be used for this method.

The limitation of model based method is

New user problem
New item problem
Sparsity

代码实验

预测是否观看某电影

推荐系统实践中有一个观点，给一个用户推荐一个item不是因为预测该用户会对该item打分数高低，而是预测该用户对该item有打分或行为的可能性。下面我们做利用movielens小型数据做实验，相似度分别计算标准如下：
1. 用户u对商品i评分过，则 $r_{u,i}=1$ ,否则 $r_{u,i}=0$ . $r_{u,i}$ 不代表评分，而是表示用户u点击商品j的倾向，数值越大表示倾向越大。
- 参数选择（近邻个数K=7)
- 聚合函数选择 $\sum\limits_{v \in N_i(u)} w_{uv}\times（ r_{v,i} ==1）$ , 其中 $N_i(u)$ 固定为7
- 召回率，准确率：(0.11920293092663747, 0.20491551459293394)
2. 用户u对商品i评分过，则 $r_{u,i}=评分$ ,否则 $r_{u,i}=0$ . $r_{u,i}$ 不代表评分，而是表示用户u点击商品j的倾向，数值越大表示倾向越大。
- 参数选择（近邻个数K=7)
- 聚合函数选择 $\sum\limits_{v \in N_i(u)} w_{uv}\times r_{v,i}$ , 其中 $N_i(u)$ 固定为7. （可以看出这里不仅考虑了是否评分，还考虑了评分的大小，两个用户之间的相似度）
- 召回率，准确率：(0.11973907604324904, 0.20583717357910905)
3. 单单推荐预测评分高的，效果交差
- 聚合函数选择 $r_{u,i} = \frac {\sum\limits_{v \in N_i(u)} w_{uv}\times r_{v,i}} {\sum \limits_{v \in N_i(u)} w_{uv}}$
- 召回率，准确率：(0.015190778303994281, 0.026113671274961597)

结论，预测评分高不代表该用户对电影就一定有行为，而预测用户是否会去看某电影，是比较合理的(对于movielens的这份数据而言）。

预测用户对某电影的评分

上面是预测用户是否会看电影，而另外一部分算法是用于预测用户对某电影的评分（喜爱程度）。协同滤波可以解决这类问题，主要分两类：

基于邻域的协同滤波，user-cf, item-cf
基于模型的协同滤波， SVD等

本文试验了user-cf与SVD算法在movielens数据上的效果，user-cf的预测分数rmse误差为1.007，svd的rmse为0.97，在该数据集上svd的预测效果更优。

数据和代码下载链接http://download.csdn.net/detail/u011467621/9133001

后续更新的代码在
https://github.com/joeyqzhou/recommendation-system

参考文献

Ricci, Francesco, Lior Rokach, and Bracha Shapira. Introduction to recommender systems handbook. Springer US, 2011.
推荐系统实践[M]. 项亮.人民邮电出版社, 2012.

注：后续有空会补上翻译
个人水平有限，欢迎发邮件[email protected]或评论讨论