The recommendation algorithm based on neighborhood systems --- collaborative filtering algorithm

Collaborative filtering based on neighborhood is divided into two categories, based on the user and collaborative filtering items based collaborative filtering. The former he recommended to the user and other users with similar interests favorite items, the latter is recommended before and he liked similar items items.

Based on user collaborative filtering algorithm

Here are user-based collaborative filtering, by definition, can be divided into the following two steps:

  1. And to find the target user set of users with similar interests
  2. And find this collection of user-friendly, and the target user never heard of the items recommended to the target user

The basic algorithm to calculate the similarity of the user:

(1) Jaccard official

(2) cosine similarity:

After obtaining the degree of similarity between the interest of users, the most similar UserCF algorithm will give users recommend and his interest in K users favorite items, the following formula means that the user u i's level of interest in the article:

Where S (u, K) contains the closest user u and K users interest, N (i) had a behavior of the user set items i, W UV is the similarity of the interest of the user u and user v, R & lt VI representatives user v i of interest in the item, in which case R & lt VI =. 1

Users can create items down to look-up table, for each item will save the user a list of the items produced a behavior,

A can be recommended to the figure above, select K = 3, the user of the article A c, e acts not over, it is possible to recommend these two items to the user A, the user A on the article c, e interest are:

Improve:

Top algorithm is problematic, for example, two people have bought "Xinhua Dictionary" this book, but this did not mean that the two of them with similar interests, because most people have bought the book, if two users to buy the "Introduction to data mining", and that two people can be considered quite similar interests, as long as data mining people will buy the book. That is, two people take on unpopular items had the same behavior better able to explain the similarity of their interests, and therefore similarity measure function is:

 

Collaborative filtering algorithm based on an article

Here collaborative filtering algorithm based article, which process consists of two steps:

  1. Calculating a similarity between items
  2. To user-generated recommendation list based on the similarity of user behavior and history items

Similarity calculation items:

N (i): the number of users like the article i | N (i) ∩N (j) |: while i like articles and items j number of users

与UserCF算法类似,用ItenCF算法计算物品相似度时,也可以首先建立用户-物品倒排表(即对每个用户建立一个包含他喜欢的物品的列表),然后对于每个用户,将物品列表中的物品两两在共现矩阵C中加1,最终将这些矩阵相加得到上边的C矩阵,其中C[i][j]记录同时喜欢物品i和物品j的用户数,最后将c矩阵归一化得到物品之间的余弦相似度矩阵W。

 

得到物品的相似度之后,ItemCF通过如下公式计算用户u对一个物品j的兴趣:

N(u)是用户喜欢的物品的集合,S(i,k)是和物品i最相似的k个物品的集合,wji 是物品j和i的相似度,rui是用户u对物品i的兴趣。对于隐反馈数据集,如果用户u对物品i有过行为,即可令rui=1,该公式的含义是,和用户历史上感兴趣的物品越相似,越有可能在用户的推荐列表中获得比较高的排名。

用户活跃度对用户的影响

除了上面的分析权重的过程,还可以考虑用户活跃度对物品相似度的影响IUF,即活跃用户对物品相似度的贡献应该小于不活跃的用户,因袭增加IUF参数来修正物品相似度的计算公式

  

物品相似度归一化

如果已经得到了物品的相似性矩阵w,则可以得到归一化之后的相似度矩阵w'

归一化之后的好处是不仅仅增加推荐的准确度,还提高了覆盖率和多样性。

 

实现算法:

复制代码
import math
import time
import pandas as pd

def calcuteSimilar(series1,series2):
    '''
    计算余弦相似度
    :param data1: 数据集1 Series
    :param data2: 数据集2 Series
    :return: 相似度
    '''
    unionLen = len(set(series1) & set(series2))
    if unionLen == 0: return 0.0
    product = len(series1) * len(series2)
    similarity = unionLen / math.sqrt(product)
    return similarity

def calcuteUser(csvpath,targetID=1,TopN=10):
    '''
    计算targetID的用户与其他用户的相似度
    :return:相似度TopN Series
    '''
    frame = pd.read_csv(csvpath)                                                        #读取数据
    targetUser = frame[frame['UserID'] == targetID]['MovieID']                          #目标用户数据
    otherUsersID = [i for i in set(frame['UserID']) if i != targetID]                   #其他用户ID
    otherUsers = [frame[frame['UserID'] == i]['MovieID'] for i in otherUsersID]         #其他用户数据
    similarlist = [calcuteSimilar(targetUser,user) for user in otherUsers]              #计算
    similarSeries = pd.Series(similarlist,index=otherUsersID)                           #Series
    return similarSeries.sort_values()[-TopN:]

def calcuteInterest(frame,similarSeries,targetItemID):
    '''
    计算目标用户对目标物品的感兴趣程度
    :param frame: 数据
    :param similarSeries: 目标用户最相似的K个用户
    :param targetItemID: 目标物品
    :return:感兴趣程度
    '''
    similarUserID = similarSeries.index                                                 #和用户兴趣最相似的K个用户
    similarUsers = [frame[frame['UserID'] == i] for i in similarUserID]                 #K个用户数据
    similarUserValues = similarSeries.values                                            #用户和其他用户的兴趣相似度
    UserInstItem = []
    for u in similarUsers:                                                              #其他用户对物品的感兴趣程度
        if targetItemID in u['MovieID'].values: UserInstItem.append(u[u['MovieID']==targetItemID]['Rating'].values[0])
        else: UserInstItem.append(0)
    interest = sum([similarUserValues[v]*UserInstItem[v]/5 for v in range(len(similarUserValues))])
    return interest

def calcuteItem(csvpath,targetUserID=1,TopN=10):
    '''
    计算推荐给targetUserID的用户的TopN物品
    :param csvpath: 数据路径
    :param targetUserID: 目标用户
    :param TopN:
    :return: TopN个物品及感兴趣程度
    '''
    frame = pd.read_csv(csvpath)                                                        #读取数据
    similarSeries = calcuteUser(csvpath=csvpath, targetID=targetUserID)                 #计算最相似K个用户
    userMovieID = set(frame[frame['UserID'] == 1]['MovieID'])                           #目标用户感兴趣的物品
    otherMovieID = set(frame[frame['UserID'] != 1]['MovieID'])                          #其他用户感兴趣的物品
    movieID = list(userMovieID ^ otherMovieID)                                          #差集
    interestList = [calcuteInterest(frame,similarSeries,movie) for movie in movieID]    #推荐
    interestSeries = pd.Series(interestList, index=movieID)
    return interestSeries.sort_values()[-TopN:]                                         #TopN

if __name__ == '__main__':
    print('start..')
    start = time.time()
    a = calcuteItem('ratings.csv')
    print(a)
    print('Cost time: %f'%(time.time()-start))
复制代码

Guess you like

Origin www.cnblogs.com/cmybky/p/11776390.html