利用Python实现基于协同过滤算法的影片推荐

协同过滤算法即对一大群人进行搜索，找出其中品味与我们相近的一小群人，并将这一小群人的偏好进行组合来构造一个推荐列表。
本文利用Python3.5分别实现了基于用户和基于物品的协同过滤算法的影片推荐。具体过程如下：先建立了一个涉及人员、物品和评价值的字典，然后利用两种相似度测量算法（欧几里得距离和皮尔逊相关度）分别基于用户和基于物品进行影片推荐及评论者推荐，最后对两种协同过滤方式的选择提出了建议。

使用字典收集偏好

新建 recommendations.py 文件，并加入以下代码构建一个数据集：

# A dictionary of movie critics and their ratings of a small
# set of movies
critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,
 'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5, 
 'The Night Listener': 3.0},
'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5, 
 'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0, 
 'You, Me and Dupree': 3.5}, 
'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
 'Superman Returns': 3.5, 'The Night Listener': 4.0},
'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
 'The Night Listener': 4.5, 'Superman Returns': 4.0, 
 'You, Me and Dupree': 2.5},
'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 
 'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
 'You, Me and Dupree': 2.0}, 
'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
 'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}

上面的字典清晰的展示了一位影评者对若干部电影的打分，分值为1-5。
这样就很容易对其进行查询和修改，如查询某人对某部影片的评分。代码如下：

>>> from recommendations import critics
>>> critics['Lisa Rose']['Snakes on a Plane']
3.5

寻找相似用户

寻找相似用户，即确定人们在品味方面的相似度。这需要将每个人与其他所有人进行对比，并计算相似度评价值。这里采用了欧几里得距离和皮尔逊相关度两套算法来计算相似度评价值。

欧几里得距离评价

欧几里得距离是多维空间中两点之间的距离，用来衡量二者的相似度。距离越小，相似度越高。
欧氏距离公式： $dist(X,Y) = \sqrt{\sum_{i=1}^n (x_i-y_i)^2}$
代码实现：

from math import sqrt

# Returns a distance-based similarity score for person1 and person2
def sim_distance(prefs,person1,person2):
  # Get the list of shared_items
  si={}
  for item in prefs[person1]: 
    if item in prefs[person2]: si[item]=1

  # if they have no ratings in common, return 0
  if len(si)==0: return 0

  # Add up the squares of all the differences
  sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) 
                      for item in prefs[person1] if item in prefs[person2]])

  return 1/(1+sum_of_squares)

这一函数返回介于0到1之间的值。调用该函数，传入两个人的名字，可计算相似度评价值。代码如下：

>>> import recommendations
>>> recommendations.sim_distance(recommendations.critics,'Lisa Rose','Gene Seymour')
0.14814814814814814

皮尔逊相关度评价

皮尔逊相关系数是判断两组数据与某一直线拟合程度的一种度量，修正了“夸大分值”，在数据不是很规范的时候（如影评者对影片的评价总是相对于平均水平偏离很大时），会给出更好的结果。相关系数越大，相似度越高。

皮尔逊相关系数公式： $r(X,Y) = \dfrac{\sum XY - \dfrac{\sum X \sum Y}{N}}{(\sum X^2 - \dfrac{(\sum X)^2}{N})(\sum Y^2 - \dfrac{(\sum Y)^2}{N})}$
代码实现：

# Returns the Pearson correlation coefficient for p1 and p2
def sim_pearson(prefs,p1,p2):
  # Get the list of mutually rated items
  si={}
  for item in prefs[p1]: 
    if item in prefs[p2]: si[item]=1

  # if they are no ratings in common, return 0
  if len(si)==0: return 0

  # Sum calculations
  n=len(si)

  # Sums of all the preferences
  sum1=sum([prefs[p1][it] for it in si])
  sum2=sum([prefs[p2][it] for it in si])

  # Sums of the squares
  sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
  sum2Sq=sum([pow(prefs[p2][it],2) for it in si])   

  # Sum of the products
  pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])

  # Calculate r (Pearson score)
  num=pSum-(sum1*sum2/n)
  den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
  if den==0: return 0

  r=num/den

  return r

这一函数返回介于-1到1之间的值。调用该函数，传入两个人的名字，可计算相似度评价值。代码如下：

>>> import recommendations
>>> recommendations.sim_pearson(recommendations.critics,'Lisa Rose','Gene Seymour')
0.39605901719066977

基于用户提供推荐

推荐影片

通过一个经过加权的评价值来为影片打分，返回所有他人评价值的加权平均、归一及排序后的列表，并推荐给对应的影评者。

最终评价值的计算方法： $r = \dfrac{\sum 评价值*相似度}{\sum有效相似度}$

该方法的执行过程如下表1-1所示：

评论者	相似度	Night	S.xNight	Lady	S.xLady	Luck	S.xLuck
Rose	0.99	3.0	2.97	2.5	2.48	3.0	2.97
Seymour	0.38	3.0	1.14	3.0	1.14	1.5	0.57
Puig	0.89	4.5	4.42			3.0	2.68
LaSalle	0.92	3.0	2.77	3.0	2.77	2.0	1.85
Matthews	0.66	3.0	1.99	3.0	1.99
总计			12.89		8.38		8.07
Sim. Sum			3.84		2.95		3.18
总计/Sim. Sum			3.35		2.83		2.53

代码实现：

# Gets recommendations for a person by using a weighted average
# of every other user's rankings
def getRecommendations(prefs,person,similarity=sim_pearson):
  totals={}
  simSums={}
  for other in prefs:
    # don't compare me to myself
    if other==person: continue
    sim=similarity(prefs,person,other)

    # ignore scores of zero or lower
    if sim<=0: continue
    for item in prefs[other]:

      # only score movies I haven't seen yet
      if item not in prefs[person] or prefs[person][item]==0:
        # Similarity * Score
        totals.setdefault(item,0)
        totals[item]+=prefs[other][item]*sim
        # Sum of similarities
        simSums.setdefault(item,0)
        simSums[item]+=sim

  # Create the normalized list
  rankings=[(total/simSums[item],item) for item,total in totals.items()]

  # Return the sorted list
  rankings.sort()
  rankings.reverse()
  return rankings

对结果进行排序后，可得到一个经过排名的影片列表，并推测出自己对每部影片的评价情况。代码如下：

>>> import recommendations
>>> recommendations.getRecommendations(recommendations.critics,'Toby')
[(3.3477895267131017, 'The Night Listener'), (2.8325499182641614, 'Lady in the Water'), (2.530980703765565, 'Just My Luck')]

>>> recommendations.getRecommendations(recommendations.critics,'Toby',
... similarity=recommendations.sim_distance)
[(3.5002478401415877, 'The Night Listener'), (2.7561242939959363, 'Lady in the Water'), (2.461988486074374, 'Just My Luck')]

可发现，选择不同的相似性度量方法，对结果的影响微乎其微。

基于物品提供推荐

两种协同过滤方式的选择

基于物品的过滤方式推荐结果更加个性化，反映用户自己的兴趣传承，对于稀疏数据集在精准度上更优，而且针对大数据集生成推荐列表时明显更快，不过有维护物品相似度的额外开销。
但是，基于用户的过滤方法更易于实现，推荐结果着重于反应和用户兴趣相似的小群体的热点，着重于维系用户的历史兴趣，更适合于规模较小的变化非常频繁的内存数据集，或者有推荐相近偏好用户给指定用户的需求。