推荐系统学习笔记01-协同过滤之基于物品的歌曲推荐

最近业务需要用到推荐系统,遂调研了推荐系统,在此作为学习笔记,以便日后查阅。

一,概述
提及最多的便是协同过滤
协同过滤可以分为如下三类:
基于用户的、基于物品的、基于模型的

本文仅为基于物品的推荐系统初步构建参考

二,项目背景与目标
根据用户听过的历史歌单,为特定用户推荐合适的歌曲。

三,数据情况
现有用户历史歌单: train_triplets.txt,数据量为48373586条数据,每条数据包含‘user’_id,‘song_id’,'play_count’三个字段。

歌曲详细信息表:track_metadata.db,含百万首歌曲信息,每首歌曲含’track_id’, ‘title’, ‘song_id’,‘artist’, 'release’等信息。

四,数据探索(此处略去具体探索过程)
注意:
(1).txt文件的读取

triplet_dataset = pd.read_csv(filepath_or_buffer=data_home+'train_triplets.txt',
                             sep='\t',header=None,
                             names=['user','song','play_count'])
triplet_dataset.shape

(48373586, 3)

之后的处理方式和csv一样

(2).bd文件的处理

conn = sqlite3.connect(data_home+'track_metadata.db')
cur = conn.cursor()
cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
cur.fetchall()

[(‘songs’,)]

track_metadata_df = pd.read_sql(con=conn,sql='select * from songs')

之后的处理和csv一样

(3)考虑到计算量的问题,此处仅仅取了一小部分数据进行建模

五,模型构建(基于物品的协同过滤模型)
(1)针对目标用户user_id,从用户历史歌单中,即train_data.csv,找出该用户以前听过的所有歌曲user_songs
(2)针对该用户听过的所有歌曲,找出每首歌的所有听众user_songs_listeners
(3)找出训练数据train_data中所有歌曲all_songs,及这些歌曲中每一首歌的所有听众,all_songs_listeners
(4)计算相似度矩阵:
通过user_songs_listeners和all_songs_listeners计算user_songs和all_songs的相似度矩阵
(5)通过相似度矩阵获得all_songs中每首歌的打分score,根据score排序,选择top K首歌曲(未出现在user_songs中的),推荐给目标用户
手写矩阵计算过程

矩阵计算过程
六,代码
1,编写模块RecommendersJsq.py
使用时调用:

import RecommendersJsq as RecommendersJsq

is_model = RecommendersJsq.item_similarity_recommender_py()
is_model.creat(train_data,'user','title')

user_id = list(train_data.user)[7]
recommendations = is_model.recommend(user_id)
print(recommendations)

输出如下结果:
在这里插入图片描述

2,模块RecommendersJsq.py的主要method如下:
(1)recommend()

    def recommend(self,user_id):
        ################################
        #A get the all songs of the user
        ###############################
        user_songs = self.get_user_items(user_id)
        print('by JSQ No. of unique songs for the user: %d' % len(user_songs))

        #################################
        #B get all songs from train_data
        ################################
        all_songs = self.get_all_items_train_data()
        print('by JSQ No. of unique songs in the training set: %d' % len(all_songs))

        ###############################################
        # C. Construct item cooccurence matrix of size
        # len(user_songs) X len(songs)
        ###############################################
        cooccurence_matrix = self.get_cooccurence_matrix(user_songs,all_songs)

        #######################################################
        #D. Use the cooccurence matrix to make recommendations
        #######################################################
        df_recommendations = self.get_top_recommendations(user_id,cooccurence_matrix,all_songs,user_songs)

        return df_recommendations

(2)get_user_items()
针对目标用户user_id,从用户历史歌单中,即train_data.csv,找出该用户以前听过的所有歌曲user_songs

    def get_user_items(self,user_id):
        user_items_data = self.train_data[self.train_data[self.user_id] == user_id]
        user_items = list((user_items_data[self.item_id]).unique())
        return user_items

(3) get_all_items_train_data()
找出训练数据train_data中所有歌曲all_songs

    # Get unique items (songs) in the training data
    def get_all_items_train_data(self):
        all_items = list(self.train_data[self.item_id].unique())
        return all_items

(4)get_cooccurence_matrix()
计算相似度矩阵(len(user_songs) X len(all_songs))
通过user_songs_listeners和all_songs_listeners计算user_songs和all_songs的相似度矩阵
对应位置的数值=【 user_songs中第i首歌的听众与all_songs中第j首歌的听众的交集】/【前述两者的并集】

 # Construct cooccurence matrix
    def get_cooccurence_matrix(self,user_songs,all_songs):

        ######################################################
        # get the listeners for each song of the user songs
        #####################################################
        user_songs_listerners = []
        for i in range(0,len(user_songs)):
            user_songs_listerners.append(self.get_item_users(user_songs[i]))

        ##############################################################
        #get the listeners for each song of all songs from train_data
        ##############################################################
        all_songs_listeners = []
        for i in range(len(all_songs)):
            all_songs_listeners.append(self.get_item_users(all_songs[i]))

        ###############################################
        # Initialize the item cooccurence matrix of size
        # len(user_songs)Xlen(all_songa)
        ##################################################
        cooccurence_matrix = np.matrix(np.zeros((len(user_songs),len(all_songs))))

        #############################################################
        # Calculate similarity between user songs and all unique songs
        # in the training data
        #############################################################
        for i in range(0,len(user_songs)):
            #Get unique listeners (users) of song (item) i
            user_i = user_songs_listerners[i]

            for j in range(0,len(all_songs)):
                # Get unique listeners (users) of song (item) j
                user_j = all_songs_listeners[j]

                # Calculate intersection of listeners of songs i and j
                interaction = user_i.intersection(user_j)

                # Calculate cooccurence_matrix[i,j] as Jaccard Index
                if len(interaction) != 0:
                    # Calculate union of listeners of songs i and j
                    union = user_i.union(user_j)
                    cooccurence_matrix[i,j] = float(len(interaction))/float(len(union))
                else:
                    cooccurence_matrix[i, j] = 0

        return cooccurence_matrix

其中涉及到对特定的一首歌,找出其所有听众:get_item_users()

   # Get unique users for a given item (song)
    def get_item_users(self,item_id):
        train_data_sub = self.train_data[self.train_data[self.item_id] == item_id]
        item_users = set(train_data_sub[self.user_id].unique())
        return item_users

(5)get_top_recommendations()
通过相似度矩阵获得all_songs中每首歌的打分score,根据score排序,选择top K首歌曲(未出现在user_songs中的),推荐给目标用户
这里的 score = 矩阵的每一列求和/矩阵总的行数 ,即待推荐歌曲与user_songs中每一首歌的相似度打分的均值。具体可见上述手写图。

    def get_top_recommendations(self,user,cooccurence_matrix,all_songs,user_songs):
        non_zero = np.count_nonzero(cooccurence_matrix)
        print('No. the non zero is %d' %non_zero)

        # Calculate a weighted average of the scores in cooccurence matrix for all user songs.
        score = cooccurence_matrix.sum(axis=0)/float(cooccurence_matrix.shape[0])
        score = np.array(score)[0].tolist()

        #Sort the indices of scores based upon their value
        #Also maintain the corresponding score
        sorted_index = sorted(((e,i) for i,e in enumerate(score)),reverse=True)

        # Create a dataframe from the following
        column = ['user','song','score','rank']
        df = pd.DataFrame(columns=column)

        # Fill the dataframe with top 10 item based recommendations
        rank = 1
        for i in range(0,len(sorted_index)):
            if ~np.isnan(sorted_index[i][0]) and all_songs[sorted_index[i][1]] not in user_songs and rank <= 10:
                df.loc[df.shape[0]] = [user,all_songs[sorted_index[i][1]],sorted_index[i][0],rank]
                rank += 1

        if df.shape[0] == 0:
            print('No songs could be recommended for user: %s' %user)
            return -1
        else:
            return df
发布了41 篇原创文章 · 获赞 14 · 访问量 3万+

猜你喜欢

转载自blog.csdn.net/weixin_43685844/article/details/104321094