KTV song recommendation-in-depth collaborative filtering

foreword

There are many recommendation algorithms. The most basic one is collaborative filtering. Some time ago, I was more interested in KTV data, and everyone only sang familiar songs. Is there a way to give you some suggestions to expand the breadth of singing? KTV recommendation may take into account many factors, such as the singer's vocal range, age, region, preferences, and so on. The first version of the algorithm temporarily recommends users only from the perspective of item base. Since it is a personal interest, there is no model feedback iterative process, and those who are interested can implement it by themselves.

Collaborative Filtering Algorithm

Collaborative filtering, also known as behavior similarity recall, is actually a similarity calculation based on co-occurrence. Item Base's collaborative filtering algorithm has several key concepts:

similarity calculation

There are many kinds of similarity calculation: co-occurrence similarity, Euclidean distance, Pearson correlation coefficient, etc. Here is the co-occurrence similarity, the formula is as follows:

Among them, N(i) is the number of users who like song i, and N(j) is the number of users who like song j, and the numerator is the number of users who like i and j at the same time. This formula is an improved formula, and N(j) is added to the numerator to penalize the similarity. I won't go into details here.

ItemBase Sum UserBase

UserBase

Find users with similar interests, and then recommend the songs of users with the same preferences to the recommended users. It is found in the table that users A and C both like songs i and k, so the two users are similar, so the song l of user C is recommended to user A. . If it is expressed in terms of co-occurrence. The detailed calculation here will involve user scores and similar user data sorting and summarization. I have an overview here.

user/song song i song j song k song l
User A 1 1 recommend
User B 1
User C 1 1 1

ItemBase

Similar to UserBase, when calculating similarity, the song matrix is ​​used to find similar songs, and then recommended according to the user's historical data. The general principle is as follows. In the table, it is found that i and k songs are liked by users A and B, so i and k are similar. If user C likes i songs, he should also like similar k songs.

user/song song i song j song k
User A 1 1
User B 1 1 1
User C 1 recommend

ItemBase is used here

Algorithm implementation

Get the user's one hot matrix of songs

  • Deduplicate songs and sort by song title
  • get conversion dictionary of song and index

Calculate the song-to-song co-occurrence matrix

  • Calculate the co-occurrence matrix

  • Count occurrences of a single song

  • Calculate co-occurrence rate value formula Calculate co-occurrence degree

recommend

If the user likes the i song then

Get recommended songs as k songs

Code

retrieve data

import elasticsearch
import elasticsearch.helpers
import re
import numpy as np
import operator

def trim_song_name(song_name):
    """
    处理歌名,过滤掉无用内容和空白
    """
    song_name = song_name.strip()
    song_name = re.sub("-?【.*?】", "", song_name)
    song_name = re.sub("-?(.*?)", "", song_name)
    song_name = re.sub("-?(.*?)", "", song_name)
    return song_name

def get_data(size=0):
    """
    获取uid=>作品名list的字典
    """
    cur_size=0
    ret = {}
    
    es_client = elasticsearch.Elasticsearch()
    search_result = elasticsearch.helpers.scan(
        es_client, 
        index="ktv_works", 
        doc_type="ktv_works", 
        scroll="10m",
        query={}
    )

    all_songs_list = []
    all_songs_set = set()
    for hit_item in search_result:
        cur_size += 1
        if size>0 and cur_size>size:
            break
            
        item = hit_item['_source']
        work_list = item['item_list']
        ret[item['uid']] = [trim_song_name(item['songname']) for item in work_list]
        
    return ret

def get_uniq_song_sort_list(song_dict):
    """
    合并重复歌曲并按歌曲名排序
    """
    return sorted(list(set(np.concatenate(list(song_dict.values())).tolist())))
    

similarity calculation

import math

# 共现数矩阵
col_show_count_matrix = np.zeros((song_count, song_count))
one_trik_matrix = np.zeros(song_count)
for i in range(song_count):
    for j in range(song_count):
        if i>j: # 对角矩阵只计算一半的矩阵
            one_trik_matrix = np.zeros(song_count)
            one_trik_matrix[i] = 1
            one_trik_matrix[j] = 1
            
            ret_m = user_song_one_hot_matrix.dot(one_trik_matrix.T)
            col_show_value = len([ix for ix in ret_m if ix==2])
            col_show_count_matrix[i,j] = col_show_value
            col_show_count_matrix[j,i] = col_show_value

# 相似度矩阵
col_show_rate_matrix = np.zeros((song_count, song_count))

# 歌曲count N(i)矩阵
song_count_matrix = np.zeros(song_count)
for i in range(song_count):
    song_col = user_song_one_hot_matrix[:,i]
    song_count_matrix[i] = len([ix for ix in song_col if ix>=1])

# 相似度矩阵计算
for i in range(song_count):
    for j in range(song_count):
        if i>j: # 对角矩阵只计算一半的矩阵
            # 相似度计算 N(i)nN(j)/sqart(N(i)*N(j))
            rate_value = col_show_count_matrix[i,j]/math.sqrt(song_count_matrix[i]*song_count_matrix[j])
            col_show_rate_matrix[i,j] = rate_value
            col_show_rate_matrix[j,i] = rate_value

recommend

import operator

def get_songs_from_recommand(col_recommand_matrix):
    return [(int_to_song[k],r_value) for k,r_value in enumerate(col_recommand_matrix) if r_value>0]

input_song = "十年"
# 构造被推荐矩阵
one_trik_matrix = np.zeros(song_count)
one_trik_matrix[song_to_int[input_song]] = 1

col_recommand_matrix = col_show_rate_matrix.dot(one_trik_matrix.T)
recommand_array = get_songs_from_recommand(col_recommand_matrix)
sorted_x = sorted(recommand_array, key=lambda k:k[1], reverse=True)

# 获取推荐结果
print(sorted_x)


result

[('Three lives and three worlds', 0.5773502691896258), ('See you at the next intersection', 0.5773502691896258), ('Love without breaking up', 0.5773502691896258),...]

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324140244&siteId=291194637